Text classification for Audience review from Rotten Tomato - Web scraping

In [181]:
from IPython.display import display, Image
Image("D:\Data_science\moviereview.png")
Out[181]:

Introduction on Reviews and Moview reviews

Online reviews are important because they have become a reference point for buyers across the globe and because so many people trust them when making purchase decisions.

Reviews are also important for Search Engine Optimization (SEO). Having positive reviews is also another way through which you can improve a website’s Search Engine visibility. The more that people talk about a brand online, the greater its visibility to Search Engines, such as Google, Yahoo and Bing.

For the audience and booking websites, analysing reviews is significant in understanding reviewer opinion about the film.

In movie booking websites, 90% of people first check out online reviews before purchasing tickets.

For the production house, analysing negative reviews can be useful for damage control.

Problem Statement:

image.png

In [191]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\client.png")
Out[191]:
In [189]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\waltdisney.png")
Out[189]:
In [192]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\waltdisneymovies.png")
Out[192]:
In [193]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\disneyfamily.png")
Out[193]:

History about the client - The walt Disney

Acquired studios/Production Houses:

In [203]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\companies.png")
Out[203]:
In [194]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\disneyand20thcen.png")
Out[194]:
In [195]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\disneystorewebsite.png")
Out[195]:
In [196]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\disneyproducts.png")
Out[196]:
In [197]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\someaudiencereviews.png")
Out[197]:
In [199]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\someaudiencereviews2.png")
Out[199]:

FACTS about the client

Disney is well know for remake, Sequel movies and animation.

Very successful in Animation movies

Not so successful in Live-Action movies

Usually Don’t release movie in Dec.

June-July is Jackpot month. They need to decide right combination of movies to release in June-July to meet the expectation and demand

They postpone the release date of few movies depends on the sentiment mix among audience.

Previous release this year

Dumbo - Remake

Aladdin – Remake

Toy story 4- Sequel

EndGame - Avengers

The Lion King - Remake

Artemis Fowl – New – Live action

Upcoming movies this year

Mistress of Evil – Remake – Oct month

Frozen II – Sequel – Nov month

Next year - Upcoming

Feb – Untitled Live action - New

Mar – Onward – Remake

Mar – Mulan – Live action – remake of animated version.

Jun – Monsters – Remake

July – Jungle Cruise – Remake(Initially planned in Oct 2019).

Oct – Untitled Live action - Sequel

History about the Movie

In [72]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\lionkinghistory.png")
Out[72]:

Something New!

Walt Disney started making CGI movies of old cartoon based movies. wanted to know how is the customer responses in their trial so far!!

Business problem statement

Expectation from Client

1. Movie level :

Overall sentiment of Audience about movie.

Frequently commented words to merchandise those words and themes.

When they can plan Sequel of Lion King.

How will the overall sentiment be

What they have to target

Any technical comments – Background music, song, voice.

Any sentiment comments – Violence, fights, pride, etc.

TV license issue decision

2. Industry level :

Sentiment on CGI based movies

Right mix of movies to release.

Should they focus on animation or Live action.

Do they need to reschedule any movie release in upcoming list.

Upcoming movies – how they should advertise.

In [121]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\TVlicensereview.png")
Out[121]:
In [140]:
Image("D:\Data_science\PHD\licensefee.png")
Out[140]:

Business Question from above Images :

Producer company makes decent above 20% of total revenue in TV Telecast license for the movie.

They have to decide when they have to stop movies in theatres and issue TV license.

This is important decision to make.

Analytical Approach to this Project:

1. Data Extraction - Web scrape

  1. Web-scrape the data, save it into a CSV File
  2. Read the data and create Sentiment column from Rating.

2. EDA_visualization_Feature engineering

  1. Show distribution plot for each variable
  2. Use Data attribute and get Weekday and Weekend features.
  3. Show distribution plot for new features.
  4. Draw patterns towards target variable.
  5. Explain insights on each plot and distribution

3. Review Text understanding/cleaning

Cleaning of Text:

  1. Apply Lower Case
  2. Remove special characters
  3. Remove alpha numericals
  4. Remove punctuation marks
  5. Word Count.
  6. Check if words in a review.
  7. Convert Dictionary to DF.
  8. Stop words, stemming, Lemmatize a. Using spacy b. Add customized stop words c. Use NLTK’s stemmer d. Tokenize e. Custom Stop words Removal f. Stemming

  9. Remove stop words from clean text datasets and run models.

  10. Plot the Text again b. Plot with stop words c. Plot frequency (Word Cloud)
  11. TF-IDF

4. Build base model

5. Vectorize the word - Review text

6. Train test split

7. Build models

8. Compare models

9. Predict on unseen data with best model.

10. Clustering of review text for good and bad reviews. Business insight and suggestion.

1. Data Extraction

Lets webscrape Audience review from Rotten Tomato for The Lion King Movie

Loading Libraries

In [20]:
#Loading required Libraries
import requests
import time
import csv
import pandas as pd
import numpy as np

Headers for our request

In [2]:
#Creating headers for our request
headers = {
'Referer': 'https://www.rottentomatoes.com/m/the_lion_king_2019/reviews?type=user',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
In [3]:
#API link to webscrape from RottenTomato
url = 'https://www.rottentomatoes.com/napi/movie/9057c2cf-7cab-317f-876f-e50b245ca76e/reviews/user'
In [4]:
#Initial payload parameters to fetch data

payload = {
'direction': 'next',
'endCursor': '',
'startCursor': '',
}
In [5]:
#Creating a Session Object with Rotten Tomato API
sess = requests.Session()
In [6]:
# To fetch one-page reviews by using GET 
r = sess.get(url, headers=headers, params=payload) # GET Call
data = r.json()
In [7]:
#Creating Empty list for Page Info and Audience review to initiate Iteration
page_info = []
Audience_reveiws = []

Extracting 3000 reviews from Rotten Tomato

In [8]:
# To get 6000 reviews, calling GET for 600 times. 

for i in range(600):
    update_start = data.get('pageInfo').get('startCursor')
    update_end = data.get('pageInfo').get('endCursor')
    payload.update({'startCursor':update_start})
    payload.update({'endCursor':update_end})
    r = sess.get(url, headers=headers, params=payload) # GET Call
    data = r.json()
    page_info.append(data.get('pageInfo'))
    Audience_reveiws.append(data.get('reviews'))
    time.sleep(5)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\Anaconda3\lib\site-packages\simplejson\scanner.py in _scan_once(string, idx)
     36         try:
---> 37             nextchar = string[idx]
     38         except IndexError:

IndexError: string index out of range

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-8-f94c22e1a3c6> in <module>
      7     payload.update({'endCursor':update_end})
      8     r = sess.get(url, headers=headers, params=payload) # GET Call
----> 9     data = r.json()
     10     page_info.append(data.get('pageInfo'))
     11     Audience_reveiws.append(data.get('reviews'))

~\Anaconda3\lib\site-packages\requests\models.py in json(self, **kwargs)
    895                     # used.
    896                     pass
--> 897         return complexjson.loads(self.text, **kwargs)
    898 
    899     @property

~\Anaconda3\lib\site-packages\simplejson\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
    516             parse_constant is None and object_pairs_hook is None
    517             and not use_decimal and not kw):
--> 518         return _default_decoder.decode(s)
    519     if cls is None:
    520         cls = JSONDecoder

~\Anaconda3\lib\site-packages\simplejson\decoder.py in decode(self, s, _w, _PY3)
    368         if _PY3 and isinstance(s, bytes):
    369             s = str(s, self.encoding)
--> 370         obj, end = self.raw_decode(s)
    371         end = _w(s, end).end()
    372         if end != len(s):

~\Anaconda3\lib\site-packages\simplejson\decoder.py in raw_decode(self, s, idx, _w, _PY3)
    398             elif ord0 == 0xef and s[idx:idx + 3] == '\xef\xbb\xbf':
    399                 idx += 3
--> 400         return self.scan_once(s, idx=_w(s, idx).end())

~\Anaconda3\lib\site-packages\simplejson\scanner.py in scan_once(string, idx)
     77             raise JSONDecodeError('Expecting value', string, idx)
     78         try:
---> 79             return _scan_once(string, idx)
     80         finally:
     81             memo.clear()

~\Anaconda3\lib\site-packages\simplejson\scanner.py in _scan_once(string, idx)
     37             nextchar = string[idx]
     38         except IndexError:
---> 39             raise JSONDecodeError(errmsg, string, idx)
     40 
     41         if nextchar == '"':

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
In [9]:
#To view received data
print(Audience_reveiws)
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Assigning column Names for extracted data

In [10]:
# Create empty data frame with Reviewer ID, Reviewer name, Review, Rating, Date_of_Review columns
# Assign column names with respective columns

col_names =  ['ReviewID','Reviewer Name', 'Review', 'Rating' ,'Date_of_Review']
review_data = pd.DataFrame(columns = col_names)
In [11]:
# Add data to the data frame created

# len(Audience_reveiws) corresponds to total of 300 pages
for i in range(0, len(Audience_reveiws)):
    
    #len(Audience_reveiws[0]) corresponds to 10 ratings which will be displayed in single page
    for j in range(0,len(Audience_reveiws[0])):
        review_data = review_data.append({'ReviewID':Audience_reveiws[i][j].get('user').get('userId'),
                                  'Reviewer Name': Audience_reveiws[i][j].get('displayName'), 
                                  'Review': Audience_reveiws[i][j].get('review'), 
                                  'Rating': Audience_reveiws[i][j].get('score') , 
                                  'Date_of_Review':Audience_reveiws[i][j].get('createDate')},
                                 ignore_index=True)
In [12]:
#Check the shape of the review data
review_data.shape
Out[12]:
(5950, 5)
In [13]:
#To view data with head of few lines
review_data.head()
Out[13]:
ReviewID Reviewer Name Review Rating Date_of_Review
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5 2019-08-22T23:52:03.870Z
1 966121979 Stephen M I loved this movie \n 5 2019-08-22T23:50:08.574Z
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4 2019-08-22T23:32:17.995Z
3 c7d41004-8ce5-46f7-ab89-b94d0c634bbe MrsR Brought my kids to the original ages ago. This... 5 2019-08-22T23:30:52.872Z
4 2c88353b-108c-436d-bb5f-6ab4f9bb8641 BRENT I grew up watching this and my kids are now. T... 5 2019-08-22T23:22:03.957Z

Exporting Review data to CSV file. (For Text Processing steps)

In [14]:
# Export data into csv file from the data frame


review_data.to_csv("audience_review.csv", sep=',', columns=['ReviewID','Reviewer Name', 'Review', 'Rating' ,'Date_of_Review'], header=True, index=False)

Creating Sentiment column

In [23]:
#To Create New column "Sentiment" - If Rating is greater than 3, Positive Sentiment. If it is less than or equal to 3, Negative

review_data['Sentiment'] = np.where(review_data['Rating']>3, 'Pos','Neg')
In [24]:
#To view data with head of few lines to verify Sentiment column
review_data.head(3)
Out[24]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z Pos

2. EDA_Visulaization_FeatureEngineering

Lets us do some Explortory Data analysis and Data visualization

Loading required Libraries

In [1]:
#Loading required Libraries 

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import re
from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))
from bs4 import BeautifulSoup
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split

from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
from plotly.offline import iplot

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
Using TensorFlow backend.
C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:517: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:518: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:519: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:520: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:541: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:542: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:543: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:544: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:545: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\dtypes.py:550: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

Reading Extracted data (Webscrapped from Rotten Tomato)

In [207]:
#Reading data 
data = pd.read_csv("audience_review.csv")

Data Understanding

In [208]:
#To view data with head of few lines
data.head()
Out[208]:
ReviewID Reviewer Name Review Rating Date_of_Review
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z
3 c7d41004-8ce5-46f7-ab89-b94d0c634bbe MrsR Brought my kids to the original ages ago. This... 5.0 2019-08-22T23:30:52.872Z
4 2c88353b-108c-436d-bb5f-6ab4f9bb8641 BRENT I grew up watching this and my kids are now. T... 5.0 2019-08-22T23:22:03.957Z

Creating Sentiment column

In [209]:
#To Create New column "Sentiment" - If Rating is greater than 3, Positive Sentiment. If it is less than or equal to 3, Negative
data['Sentiment'] = np.where(data['Rating']>3, 'Pos', 'Neg')
In [210]:
#To view data with head of few lines to verify Sentiment column
data.head()
Out[210]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z Pos
3 c7d41004-8ce5-46f7-ab89-b94d0c634bbe MrsR Brought my kids to the original ages ago. This... 5.0 2019-08-22T23:30:52.872Z Pos
4 2c88353b-108c-436d-bb5f-6ab4f9bb8641 BRENT I grew up watching this and my kids are now. T... 5.0 2019-08-22T23:22:03.957Z Pos
In [211]:
data.shape
Out[211]:
(5950, 6)

Check the Level of Positive and Negative reviews

In [212]:
#First data collection - Imbalanced data
data.Sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[212]:
<matplotlib.axes._subplots.AxesSubplot at 0x213bbfd0>

Insight on Data collection

Extracted data had imbalance with less value count of Negative review. so, Extracted Data again with 6000 records and will select balance data of Positive and negative levels.

In [213]:
data= pd.DataFrame(data)
In [214]:
Postivie_review= data[data['Sentiment'] == 'Pos'].head(1500)
Negative_review= data[data['Sentiment'] == 'Neg'].head(1500)
In [215]:
Postivie_review.head(3)
Out[215]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z Pos
In [216]:
Negative_review.head(3)
Out[216]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
7 977536013 Dj P It was an okay adaptation, but the animated mo... 3.0 2019-08-22T23:10:34.796Z Neg
8 f5a7061d-78e6-44b1-b8ab-21d4ed9aa28d Katrina This movie had two choices: have fun and be en... 2.0 2019-08-22T22:56:04.416Z Neg
12 65F20DB8-0210-480B-B30B-F9D3234EC3CC Sandra The acting and it was pretty bad 2.0 2019-08-22T21:38:50.251Z Neg
In [217]:
#Combine both the positive and negative reviews data
combine_data=pd.concat([Postivie_review, Negative_review])
In [218]:
#To check if combined data has Sentiment levels properly (Sentiment level 1)
combine_data.head(3)
Out[218]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z Pos
In [219]:
#To check if combined data has Sentiment levels properly (Sentiment level 0)
combine_data.tail(3)
Out[219]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
5551 978175094 Adam D It had amazing graphics and cinematography. An... 1.5 2019-07-28T06:45:06.273Z Neg
5552 945088284 NaN Overall mediocre. You already know the story. ... 2.0 2019-07-28T06:44:25.367Z Neg
5553 f28cac31-49ac-470f-b003-2b033b71e7e6 Jamie Great effects, very slow talking. Much like wa... 3.0 2019-07-28T06:42:45.124Z Neg
In [220]:
#Renaming to convienient name
data1=combine_data
In [221]:
#To check shape of the data
data1.shape
Out[221]:
(3000, 6)

Balance Data is ready for Data visualization and Processing

Distribution of few Variables

In [205]:
plot_size = plt.rcParams["figure.figsize"] 
print(plot_size[0]) 
print(plot_size[1])


plot_size[0] = 12
plot_size[1] = 10
plt.rcParams["figure.figsize"] = plot_size 
6.0
4.0

Sentiment distribution plot

In [222]:
# Sentiment distribution plot

Sentiment = data1.Sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%', shadow=True)

Rating distribution plot

In [223]:
#Rating distribution plot
data1.Rating.value_counts().plot(kind='pie', autopct='%1.0f%%', shadow=True)
Out[223]:
<matplotlib.axes._subplots.AxesSubplot at 0x2165be80>
In [224]:
#To check the count of Rating in different level/value
data1.Rating.value_counts()
Out[224]:
5.0    944
3.0    514
4.0    306
2.0    295
2.5    238
0.5    180
1.0    172
3.5    130
4.5    120
1.5    101
Name: Rating, dtype: int64

Insights from Value counts and Percentage of Negative:

  1. Overall Sentiment of the movie is Positive.
  2. 25% of the Audience gave less than 2 star rating
  3. 1/3rd of the Audience gave less than 2.5 star rating.
  4. At the same time, 1/3 rd of Audience gave full rating of 5 stars.
In [225]:
#To check the count of Sentiment in different level/value
data1.Sentiment.value_counts()
Out[225]:
Neg    1500
Pos    1500
Name: Sentiment, dtype: int64
In [226]:
#To create a function to view Text data
def print_plot(index):
    example = data[data.index == index][['Review', 'Sentiment']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Sentiment:', example[1])
In [227]:
#Print to view sample Text data
print_plot(100)
Movie was Good , just really wished they had Ed the hyena laughing just like in the original cartoon version lol
Sentiment: Pos
In [229]:
#Converting Date to appropriate data type
data1.Date_of_Review= pd.to_datetime(data1.Date_of_Review)
In [230]:
#To check Date column changed data type
data1.head()
Out[230]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22 23:52:03.870000+00:00 Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22 23:50:08.574000+00:00 Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22 23:32:17.995000+00:00 Pos
3 c7d41004-8ce5-46f7-ab89-b94d0c634bbe MrsR Brought my kids to the original ages ago. This... 5.0 2019-08-22 23:30:52.872000+00:00 Pos
4 2c88353b-108c-436d-bb5f-6ab4f9bb8641 BRENT I grew up watching this and my kids are now. T... 5.0 2019-08-22 23:22:03.957000+00:00 Pos

Feature Engineering

Lets us bring in some new features to explain about data clearly

Weekday Feature creation

In [231]:
#To create new feature - Weekday 
data1['Weekday']=data1['Date_of_Review'].dt.weekday_name
In [232]:
#To view Weekday column
data1.head(3)
Out[232]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment Weekday
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22 23:52:03.870000+00:00 Pos Thursday
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22 23:50:08.574000+00:00 Pos Thursday
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22 23:32:17.995000+00:00 Pos Thursday

Weekend Feature creation

In [233]:
#To create new feature if the review is created on weekday or weekend
data1['dow'] = data1['Date_of_Review'].apply(lambda x: x.date().weekday())
data1['is_weekend'] = data1['Date_of_Review'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
In [234]:
#To view Weekend column
data1.head(3)
Out[234]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment Weekday dow is_weekend
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22 23:52:03.870000+00:00 Pos Thursday 3 0
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22 23:50:08.574000+00:00 Pos Thursday 3 0
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22 23:32:17.995000+00:00 Pos Thursday 3 0
In [235]:
#To view count of reivew for each day in Week
data1.Weekday.value_counts()
Out[235]:
Tuesday      561
Monday       510
Thursday     481
Wednesday    458
Sunday       427
Saturday     289
Friday       274
Name: Weekday, dtype: int64

Weekday Distribution

In [236]:
#Weekday distribution plot on Weekdays
data1.Weekday.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[236]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cd8c8d0>

Insight on Weekday distribution

  1. More people wrote review on Tuesday and Monday.
  2. Very less people wrote review on Friday
In [237]:
#To view count of review if it is on weekend or not
data1.is_weekend.value_counts()
Out[237]:
0    2284
1     716
Name: is_weekend, dtype: int64
In [238]:
#Plot to view count of review on weekend
data1.is_weekend.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[238]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cfb6c50>

Insight on Weekend distribution

  1. Most of the audience wrote review on Weekday. They might have watched movie on weekend, wrote review next day.
  2. 24% of people wrote review on Weekend. Lets see how was the review sentiment on Different days
In [240]:
#To view count of Positive review on weekend
data1[data1.Sentiment=='Pos'].is_weekend.value_counts(normalize=True)
Out[240]:
0    0.795333
1    0.204667
Name: is_weekend, dtype: float64
In [241]:
#Plot to view count of Positive review on weekend
data1[data1.Sentiment=='Pos'].is_weekend.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[241]:
<matplotlib.axes._subplots.AxesSubplot at 0x20f16ef0>

Negative review analysis

In [242]:
#To view count of Negative review on weekend
data1[data1.Sentiment=='Neg'].is_weekend.value_counts(normalize=True)
Out[242]:
0    0.727333
1    0.272667
Name: is_weekend, dtype: float64
In [243]:
#Plot to view count of Negative review on weekend
data1[data1.Sentiment=='Neg'].is_weekend.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[243]:
<matplotlib.axes._subplots.AxesSubplot at 0x2107bc18>
In [245]:
#To view count of Positive review on each day of the week
data1[data1.Sentiment=='Neg'].Weekday.value_counts(normalize=True)
Out[245]:
Monday       0.220000
Sunday       0.178000
Tuesday      0.156667
Wednesday    0.138000
Thursday     0.124000
Saturday     0.094667
Friday       0.088667
Name: Weekday, dtype: float64
In [246]:
#Plot to view count of Positive review on each day of the week
data1[data1.Sentiment=='Neg'].Weekday.value_counts().plot(kind='pie', autopct='%1.0f%%')
Out[246]:
<matplotlib.axes._subplots.AxesSubplot at 0x2114c320>

Insight on Negative reviews

  1. Though more Audience wrote review on Tuesday, Negative reviews are higher on Monday. May be because of Monday Blue :(
  2. More Negative reviews on Sunday and Monday.
  3. More Negative reviews on Weekend compared to overall weekday
  4. Very less negative review on Friday and Saturday. May be because Audience are in Party mood ! :)

3. Review Text feature engineering/Text processing

In [22]:
#To view data and understand it. 
data1.head(3)
Out[22]:
ReviewID Reviewer Name Review Rating Date_of_Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 kascam Amazing cinematography! I don't know how they... 5.0 2019-08-22T23:52:03.870Z Pos
1 966121979 Stephen M I loved this movie \n 5.0 2019-08-22T23:50:08.574Z Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A William It was pretty awesome...I was floored by the s... 4.0 2019-08-22T23:32:17.995Z Pos
In [9]:
#Dropping few of the columns which are unneccessary for analysis
data2= data1.drop(['Reviewer Name', 'Rating', 'Date_of_Review'], axis=1)
In [10]:
#To check the data after dropping few columns
data2.head(3)
Out[10]:
ReviewID Review Sentiment
0 9bd27314-fc78-41fe-ba69-42669bc763d4 Amazing cinematography! I don't know how they... Pos
1 966121979 I loved this movie \n Pos
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A It was pretty awesome...I was floored by the s... Pos
In [25]:
#To check shape of the data
data2.shape
Out[25]:
(3000, 3)
In [11]:
#To take backup of the modified data to use it in future. 
data2.to_csv("data2.csv", sep=',', columns=['ReviewID', 'Review','Sentiment'], header=True, index=False)
In [27]:
#Exporting review text to Text file - just as a backup file
data2['Review'].to_csv("review.txt", sep=',', columns=['Review',], header=True, index=False)

Text Cleaning

Audience review has mix of special characters, numbers and bad symbols.

It needs to be cleaned before we can use review text for classification models.

In [2]:
#To read file from backup.
data2 = pd.read_csv("data2.csv")
In [3]:
#Text cleaning code 
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]') #Removing special symbols
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') #Removing bad symabols
STOPWORDS = set(stopwords.words('english')) #Removing stop words

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
#   text = re.sub(r'\W+', '', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text
data2['Review'] = data2['Review'].apply(clean_text)
In [29]:
#To take back up of clean review text - just to view
data2['Review'].to_csv("cleanreview.txt", sep=',', columns=['Review',], header=True, index=False)

4. Building Base model - LSTM to understand Model Performance

In [94]:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 2000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 500
# This is fixed.
EMBEDDING_DIM = 100

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(data2['Review'])
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))
Found 5585 unique tokens.
In [96]:
#Tokenizing text data and Padding sequence
X = tokenizer.texts_to_sequences(data2['Review'])
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Shape of data tensor: (3000, 500)
In [98]:
Y = pd.get_dummies(data2['Sentiment'])
print('Shape of label tensor:', Y.shape)
Shape of label tensor: (3000, 2)

Train Test split for Model

In [99]:
#Train-Test split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 432)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
(2400, 500) (2400, 2)
(600, 500) (600, 2)

LSTM model building with 40% dropout,

In [37]:
#Building Model
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(LSTM(100, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
W0824 10:10:47.523093  4392 deprecation_wrapper.py:119] From C:\Users\Lokesh\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0824 10:10:47.582097  4392 deprecation_wrapper.py:119] From C:\Users\Lokesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0824 10:10:47.599098  4392 deprecation.py:323] From C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 500, 100)          15000000  
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 202       
=================================================================
Total params: 15,080,602
Trainable params: 15,080,602
Non-trainable params: 0
_________________________________________________________________
None
In [38]:
#Fitting model

%time
epochs = 3
batch_size = 64

history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1)
Wall time: 0 ns
Train on 2160 samples, validate on 240 samples
Epoch 1/3
2160/2160 [==============================] - 117s 54ms/step - loss: 0.6580 - acc: 0.6495 - val_loss: 0.5986 - val_acc: 0.6958
Epoch 2/3
2160/2160 [==============================] - 119s 55ms/step - loss: 0.5044 - acc: 0.7648 - val_loss: 0.4816 - val_acc: 0.7750
Epoch 3/3
2160/2160 [==============================] - 125s 58ms/step - loss: 0.3385 - acc: 0.8616 - val_loss: 0.4698 - val_acc: 0.7708
In [39]:
#Model Evaluation 
accr = model.evaluate(X_test,Y_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}'.format(accr[0],accr[1]))
600/600 [==============================] - 7s 11ms/step
Test set
  Loss: 0.473
  Accuracy: 0.793

LSTM Model gave accuracy of 79.3% in first run

Loss plot for LSTM model

In [53]:
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();

Accuracy plot for LSTM model

In [54]:
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();

Testing model performance on sample Unseen data

In [57]:
new_Review = ['Awesome movie I love how close it was to the original film absolutely amazing ❤❤❤']
seq = tokenizer.texts_to_sequences(new_Review)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
labels = ['1','0']
print(pred, labels[np.argmax(pred)])
[[0.00907096 0.99092907]] 0
In [58]:
new_Review = ['Its the same move years ago. This time boring. Even my grandsons didnt like it.']
seq = tokenizer.texts_to_sequences(new_Review)
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
labels = ['1','0']
print(pred, labels[np.argmax(pred)])
[[0.9355758  0.06442422]] 1

LSTM model is predicting correctly to 2 unseen data.

will predict on unseen test data to check performance

Loading Unseen Test data

In [108]:
# Loading Unseen Test data
Unseen_test = pd.read_csv("test-1566619745327.csv")
In [109]:
Unseen_test.head(3)
Out[109]:
ReviewID review
0 92876 Was good. Nothing like the original but I beli...
1 92877 I absolutely loved it! A wonderful rendition o...
2 92878 I love the movie! Good job director! \nI appre...
In [116]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]') #Removing special symbols
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') #Removing bad symabols
STOPWORDS = set(stopwords.words('english')) #Removing stop words

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
#   text = re.sub(r'\W+', '', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text
Unseen_test['review'] = Unseen_test['review'].apply(clean_text)
In [250]:
#Tokenizing text data and Padding sequence
X = tokenizer.texts_to_sequences(Unseen_test['review'])
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Shape of data tensor: (1200, 500)
In [64]:
seq = tokenizer.texts_to_sequences(Unseen_test['review'])
padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred_unseen = model.predict(padded)
labels = ['0','1']
lstm_output = labels[np.argmax(pred_unseen)]
In [65]:
print(pred_unseen)
[[0.8021249  0.19787511]
 [0.4142864  0.58571357]
 [0.18250997 0.81749004]
 ...
 [0.4568032  0.5431968 ]
 [0.3344999  0.66550016]
 [0.97317684 0.02682317]]
In [73]:
pred_unseen[0]
Out[73]:
array([0.8021249 , 0.19787511], dtype=float32)
In [75]:
pred_unseen[1199]
Out[75]:
array([0.97317684, 0.02682317], dtype=float32)
In [82]:
lstm_output = []
In [87]:
#Getting max of Levels
labels = ['0','1']
i=0
for i in range(0,1199):
    lstm_output.append(labels[np.argmax(pred_unseen[i])])
In [92]:
lstm_OUT = pd.DataFrame(lstm_output)
In [93]:
#Exporting LSTM model output to CSV file. 
lstm_OUT.to_csv("lstm_output.csv")

LSTM model output gave 0.34 Metric score at first trial run [on only with basic cleaned text]

Overfitted Model

5. Review Text Understanding

Lets understand some insights from Text/Audience review

In [94]:
#To create word count feature with taking length of text split words
data2['word_count'] = [len(text.split(' ')) for text in data2['Review']]
In [95]:
data2.head(3)
Out[95]:
ReviewID Review Sentiment word_count
0 9bd27314-fc78-41fe-ba69-42669bc763d4 amazing cinematography dont know wonderful Pos 5
1 966121979 loved movie Pos 2
2 75D441F3-4AE6-4447-9702-8EDD3BA4153A pretty awesomei floored story cgi Pos 5
In [96]:
## Getting the first quartile value
q1 = np.percentile(data2.word_count,25)
print(f"The first quartile value of words_count attribute is {q1}")
The first quartile value of words_count attribute is 5.0
In [130]:
## Getting the second quartile value
q2 = np.percentile(data2.word_count,50)
print(f"The first quartile value of words_count attribute is {q2}")
The first quartile value of words_count attribute is 9.0
In [131]:
## Getting the Third quartile value
q3 = np.percentile(data2.word_count,75)
print(f"The first quartile value of words_count attribute is {q3}")
The first quartile value of words_count attribute is 17.0
In [132]:
## Getting the 90% value
q90 = np.percentile(data2.word_count,90)
print(f"The first quartile value of words_count attribute is {q90}")
The first quartile value of words_count attribute is 34.0
In [70]:
labels = ['q1', 'q2', 'q3', 'q90']
sizes = [5,9,17,34]
patches= plt.bar(x=labels, height=sizes, width=0.2)
plt.legend(patches, labels, loc="best")
plt.tight_layout()
plt.show()
Out[70]:
<matplotlib.legend.Legend at 0x1d40b358>

Insight

  1. 25% of reviews have only 5 words and less.
  2. 50% of the reviews have 9 words and less.
  3. 75% of the reviews have 17 words and less.
  4. only 15% of reviews are more than 30 words

6. Review Text feature engineering/Text processing - Continued

Loading Spacy library

In [4]:
# Loading Spacy library
import spacy
nlp = spacy.load("en_core_web_sm")
In [5]:
## load spacy's English stopwords as variable called 'stopwords'

stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(stopwords))
print('First ten stop words: %s' % list(stopwords)[:10])
Number of stop words: 326
First ten stop words: ['’ll', 'ever', 'becoming', 'beside', 'in', 'whither', 'me', 'used', 'whence', 'toward']
In [6]:
## load nltk's SnowballStemmer as variable 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
In [7]:
# Here I define a tokenizer and stemmer which returns the set of stems (excluding stop words) in the text that it is passed

def tokenize_and_stem(doc, remove_stopwords = True):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    if remove_stopwords:
        tokens = [word.text for word in doc if not word.is_stop]
    else:
        tokens = [word.text for word in doc]
        
    #print(tokens[:5])
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    
    #print("ended re.search")
    stems = [stemmer.stem(t) for t in filtered_tokens]
    #print("returning stems")
    return stems

def tokenize_and_lemmatize(doc, remove_stopwords = True):
    
    # spaCy will convert word to lower case and changing past tense, 
    # gerund form (other tenses as well) to present tense. Also, “they” normalize to “-PRON-” which is pronoun.

    if remove_stopwords:
        tokens = [word for word in doc if not word.is_stop]
    else:
        tokens = [word for word in doc]
    #print("Completed tokenization")
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token.text):
            filtered_tokens.append(token)
            
    #print("ended re.search")
    lemma = [t.lemma_ for t in filtered_tokens]
    #print("returning stems")
    return lemma


def tokenize_only(doc, remove_stopwords = True):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    if remove_stopwords:
        tokens = [word.text for word in doc if not word.is_stop]
    else:
        tokens = [word.text for word in doc]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens
In [8]:
#Converting to dictory 
data2 = data2.reset_index().to_dict(orient='list')
In [9]:
## We are trying to create four seperate lists for text with stop words, text without stop words,
## text with stemmed words and text with lemmatized words.

## Naming Conventions followed ####

## 'clean' word is appended to lists which do not contain stopwords

## 'all' keyword is appended to lists which contain stopwords.

## use extend so it's a big flat list of vocab

data2['clean_text_stemmed'] = []
data2['clean_text_lemmatized'] = []
data2['text_stemmed'] = []
data2['text_lemmatized'] = []

vocab_stemmed = []

vocab_tokenized = []
allvocab_tokenized = []

vocab_lemmatized = []
allvocab_lemmatized = []


for idx,text in enumerate(data2['Review']):

## first convert the entire text into spacy document type
#     print(f"The type of text is {type(text)} and text is {text}")
#     print(f"The type of idx is {type(idx)} and idx is {idx}")
    doc = nlp(text)
    print(f"processing {idx} document")
    words_stemmed = tokenize_and_stem(doc)
    words_lemmatized = tokenize_and_lemmatize(doc)
    vocab_stemmed.extend(words_stemmed)
    vocab_lemmatized.extend(words_lemmatized)
    
    data2['clean_text_stemmed'].append(words_stemmed)
    data2['clean_text_lemmatized'].append(words_lemmatized)
    
    allwords_stemmed = tokenize_and_stem(doc, False) 
    allwords_lemmatized = tokenize_and_lemmatize(doc, False)
    allvocab_lemmatized.extend(allwords_lemmatized)
    
    data2['text_stemmed'].append(allwords_stemmed)
    data2['text_lemmatized'].append(allwords_lemmatized)
    
    allwords_tokenized = tokenize_only(doc,False)
    allvocab_tokenized.extend(allwords_tokenized)
    
    words_tokenized = tokenize_only(doc)
    vocab_tokenized.extend(words_tokenized)
processing 0 document
processing 1 document
processing 2 document
processing 3 document
processing 4 document
processing 5 document
processing 6 document
processing 7 document
processing 8 document
processing 9 document
processing 10 document
processing 11 document
processing 12 document
processing 13 document
processing 14 document
processing 15 document
processing 16 document
processing 17 document
processing 18 document
processing 19 document
processing 20 document
processing 21 document
processing 22 document
processing 23 document
processing 24 document
processing 25 document
processing 26 document
processing 27 document
processing 28 document
processing 29 document
processing 30 document
processing 31 document
processing 32 document
processing 33 document
processing 34 document
processing 35 document
processing 36 document
processing 37 document
processing 38 document
processing 39 document
processing 40 document
processing 41 document
processing 42 document
processing 43 document
processing 44 document
processing 45 document
processing 46 document
processing 47 document
processing 48 document
processing 49 document
processing 50 document
processing 51 document
processing 52 document
processing 53 document
processing 54 document
processing 55 document
processing 56 document
processing 57 document
processing 58 document
processing 59 document
processing 60 document
processing 61 document
processing 62 document
processing 63 document
processing 64 document
processing 65 document
processing 66 document
processing 67 document
processing 68 document
processing 69 document
processing 70 document
processing 71 document
processing 72 document
processing 73 document
processing 74 document
processing 75 document
processing 76 document
processing 77 document
processing 78 document
processing 79 document
processing 80 document
processing 81 document
processing 82 document
processing 83 document
processing 84 document
processing 85 document
processing 86 document
processing 87 document
processing 88 document
processing 89 document
processing 90 document
processing 91 document
processing 92 document
processing 93 document
processing 94 document
processing 95 document
processing 96 document
processing 97 document
processing 98 document
processing 99 document
processing 100 document
processing 101 document
processing 102 document
processing 103 document
processing 104 document
processing 105 document
processing 106 document
processing 107 document
processing 108 document
processing 109 document
processing 110 document
processing 111 document
processing 112 document
processing 113 document
processing 114 document
processing 115 document
processing 116 document
processing 117 document
processing 118 document
processing 119 document
processing 120 document
processing 121 document
processing 122 document
processing 123 document
processing 124 document
processing 125 document
processing 126 document
processing 127 document
processing 128 document
processing 129 document
processing 130 document
processing 131 document
processing 132 document
processing 133 document
processing 134 document
processing 135 document
processing 136 document
processing 137 document
processing 138 document
processing 139 document
processing 140 document
processing 141 document
processing 142 document
processing 143 document
processing 144 document
processing 145 document
processing 146 document
processing 147 document
processing 148 document
processing 149 document
processing 150 document
processing 151 document
processing 152 document
processing 153 document
processing 154 document
processing 155 document
processing 156 document
processing 157 document
processing 158 document
processing 159 document
processing 160 document
processing 161 document
processing 162 document
processing 163 document
processing 164 document
processing 165 document
processing 166 document
processing 167 document
processing 168 document
processing 169 document
processing 170 document
processing 171 document
processing 172 document
processing 173 document
processing 174 document
processing 175 document
processing 176 document
processing 177 document
processing 178 document
processing 179 document
processing 180 document
processing 181 document
processing 182 document
processing 183 document
processing 184 document
processing 185 document
processing 186 document
processing 187 document
processing 188 document
processing 189 document
processing 190 document
processing 191 document
processing 192 document
processing 193 document
processing 194 document
processing 195 document
processing 196 document
processing 197 document
processing 198 document
processing 199 document
processing 200 document
processing 201 document
processing 202 document
processing 203 document
processing 204 document
processing 205 document
processing 206 document
processing 207 document
processing 208 document
processing 209 document
processing 210 document
processing 211 document
processing 212 document
processing 213 document
processing 214 document
processing 215 document
processing 216 document
processing 217 document
processing 218 document
processing 219 document
processing 220 document
processing 221 document
processing 222 document
processing 223 document
processing 224 document
processing 225 document
processing 226 document
processing 227 document
processing 228 document
processing 229 document
processing 230 document
processing 231 document
processing 232 document
processing 233 document
processing 234 document
processing 235 document
processing 236 document
processing 237 document
processing 238 document
processing 239 document
processing 240 document
processing 241 document
processing 242 document
processing 243 document
processing 244 document
processing 245 document
processing 246 document
processing 247 document
processing 248 document
processing 249 document
processing 250 document
processing 251 document
processing 252 document
processing 253 document
processing 254 document
processing 255 document
processing 256 document
processing 257 document
processing 258 document
processing 259 document
processing 260 document
processing 261 document
processing 262 document
processing 263 document
processing 264 document
processing 265 document
processing 266 document
processing 267 document
processing 268 document
processing 269 document
processing 270 document
processing 271 document
processing 272 document
processing 273 document
processing 274 document
processing 275 document
processing 276 document
processing 277 document
processing 278 document
processing 279 document
processing 280 document
processing 281 document
processing 282 document
processing 283 document
processing 284 document
processing 285 document
processing 286 document
processing 287 document
processing 288 document
processing 289 document
processing 290 document
processing 291 document
processing 292 document
processing 293 document
processing 294 document
processing 295 document
processing 296 document
processing 297 document
processing 298 document
processing 299 document
processing 300 document
processing 301 document
processing 302 document
processing 303 document
processing 304 document
processing 305 document
processing 306 document
processing 307 document
processing 308 document
processing 309 document
processing 310 document
processing 311 document
processing 312 document
processing 313 document
processing 314 document
processing 315 document
processing 316 document
processing 317 document
processing 318 document
processing 319 document
processing 320 document
processing 321 document
processing 322 document
processing 323 document
processing 324 document
processing 325 document
processing 326 document
processing 327 document
processing 328 document
processing 329 document
processing 330 document
processing 331 document
processing 332 document
processing 333 document
processing 334 document
processing 335 document
processing 336 document
processing 337 document
processing 338 document
processing 339 document
processing 340 document
processing 341 document
processing 342 document
processing 343 document
processing 344 document
processing 345 document
processing 346 document
processing 347 document
processing 348 document
processing 349 document
processing 350 document
processing 351 document
processing 352 document
processing 353 document
processing 354 document
processing 355 document
processing 356 document
processing 357 document
processing 358 document
processing 359 document
processing 360 document
processing 361 document
processing 362 document
processing 363 document
processing 364 document
processing 365 document
processing 366 document
processing 367 document
processing 368 document
processing 369 document
processing 370 document
processing 371 document
processing 372 document
processing 373 document
processing 374 document
processing 375 document
processing 376 document
processing 377 document
processing 378 document
processing 379 document
processing 380 document
processing 381 document
processing 382 document
processing 383 document
processing 384 document
processing 385 document
processing 386 document
processing 387 document
processing 388 document
processing 389 document
processing 390 document
processing 391 document
processing 392 document
processing 393 document
processing 394 document
processing 395 document
processing 396 document
processing 397 document
processing 398 document
processing 399 document
processing 400 document
processing 401 document
processing 402 document
processing 403 document
processing 404 document
processing 405 document
processing 406 document
processing 407 document
processing 408 document
processing 409 document
processing 410 document
processing 411 document
processing 412 document
processing 413 document
processing 414 document
processing 415 document
processing 416 document
processing 417 document
processing 418 document
processing 419 document
processing 420 document
processing 421 document
processing 422 document
processing 423 document
processing 424 document
processing 425 document
processing 426 document
processing 427 document
processing 428 document
processing 429 document
processing 430 document
processing 431 document
processing 432 document
processing 433 document
processing 434 document
processing 435 document
processing 436 document
processing 437 document
processing 438 document
processing 439 document
processing 440 document
processing 441 document
processing 442 document
processing 443 document
processing 444 document
processing 445 document
processing 446 document
processing 447 document
processing 448 document
processing 449 document
processing 450 document
processing 451 document
processing 452 document
processing 453 document
processing 454 document
processing 455 document
processing 456 document
processing 457 document
processing 458 document
processing 459 document
processing 460 document
processing 461 document
processing 462 document
processing 463 document
processing 464 document
processing 465 document
processing 466 document
processing 467 document
processing 468 document
processing 469 document
processing 470 document
processing 471 document
processing 472 document
processing 473 document
processing 474 document
processing 475 document
processing 476 document
processing 477 document
processing 478 document
processing 479 document
processing 480 document
processing 481 document
processing 482 document
processing 483 document
processing 484 document
processing 485 document
processing 486 document
processing 487 document
processing 488 document
processing 489 document
processing 490 document
processing 491 document
processing 492 document
processing 493 document
processing 494 document
processing 495 document
processing 496 document
processing 497 document
processing 498 document
processing 499 document
processing 500 document
processing 501 document
processing 502 document
processing 503 document
processing 504 document
processing 505 document
processing 506 document
processing 507 document
processing 508 document
processing 509 document
processing 510 document
processing 511 document
processing 512 document
processing 513 document
processing 514 document
processing 515 document
processing 516 document
processing 517 document
processing 518 document
processing 519 document
processing 520 document
processing 521 document
processing 522 document
processing 523 document
processing 524 document
processing 525 document
processing 526 document
processing 527 document
processing 528 document
processing 529 document
processing 530 document
processing 531 document
processing 532 document
processing 533 document
processing 534 document
processing 535 document
processing 536 document
processing 537 document
processing 538 document
processing 539 document
processing 540 document
processing 541 document
processing 542 document
processing 543 document
processing 544 document
processing 545 document
processing 546 document
processing 547 document
processing 548 document
processing 549 document
processing 550 document
processing 551 document
processing 552 document
processing 553 document
processing 554 document
processing 555 document
processing 556 document
processing 557 document
processing 558 document
processing 559 document
processing 560 document
processing 561 document
processing 562 document
processing 563 document
processing 564 document
processing 565 document
processing 566 document
processing 567 document
processing 568 document
processing 569 document
processing 570 document
processing 571 document
processing 572 document
processing 573 document
processing 574 document
processing 575 document
processing 576 document
processing 577 document
processing 578 document
processing 579 document
processing 580 document
processing 581 document
processing 582 document
processing 583 document
processing 584 document
processing 585 document
processing 586 document
processing 587 document
processing 588 document
processing 589 document
processing 590 document
processing 591 document
processing 592 document
processing 593 document
processing 594 document
processing 595 document
processing 596 document
processing 597 document
processing 598 document
processing 599 document
processing 600 document
processing 601 document
processing 602 document
processing 603 document
processing 604 document
processing 605 document
processing 606 document
processing 607 document
processing 608 document
processing 609 document
processing 610 document
processing 611 document
processing 612 document
processing 613 document
processing 614 document
processing 615 document
processing 616 document
processing 617 document
processing 618 document
processing 619 document
processing 620 document
processing 621 document
processing 622 document
processing 623 document
processing 624 document
processing 625 document
processing 626 document
processing 627 document
processing 628 document
processing 629 document
processing 630 document
processing 631 document
processing 632 document
processing 633 document
processing 634 document
processing 635 document
processing 636 document
processing 637 document
processing 638 document
processing 639 document
processing 640 document
processing 641 document
processing 642 document
processing 643 document
processing 644 document
processing 645 document
processing 646 document
processing 647 document
processing 648 document
processing 649 document
processing 650 document
processing 651 document
processing 652 document
processing 653 document
processing 654 document
processing 655 document
processing 656 document
processing 657 document
processing 658 document
processing 659 document
processing 660 document
processing 661 document
processing 662 document
processing 663 document
processing 664 document
processing 665 document
processing 666 document
processing 667 document
processing 668 document
processing 669 document
processing 670 document
processing 671 document
processing 672 document
processing 673 document
processing 674 document
processing 675 document
processing 676 document
processing 677 document
processing 678 document
processing 679 document
processing 680 document
processing 681 document
processing 682 document
processing 683 document
processing 684 document
processing 685 document
processing 686 document
processing 687 document
processing 688 document
processing 689 document
processing 690 document
processing 691 document
processing 692 document
processing 693 document
processing 694 document
processing 695 document
processing 696 document
processing 697 document
processing 698 document
processing 699 document
processing 700 document
processing 701 document
processing 702 document
processing 703 document
processing 704 document
processing 705 document
processing 706 document
processing 707 document
processing 708 document
processing 709 document
processing 710 document
processing 711 document
processing 712 document
processing 713 document
processing 714 document
processing 715 document
processing 716 document
processing 717 document
processing 718 document
processing 719 document
processing 720 document
processing 721 document
processing 722 document
processing 723 document
processing 724 document
processing 725 document
processing 726 document
processing 727 document
processing 728 document
processing 729 document
processing 730 document
processing 731 document
processing 732 document
processing 733 document
processing 734 document
processing 735 document
processing 736 document
processing 737 document
processing 738 document
processing 739 document
processing 740 document
processing 741 document
processing 742 document
processing 743 document
processing 744 document
processing 745 document
processing 746 document
processing 747 document
processing 748 document
processing 749 document
processing 750 document
processing 751 document
processing 752 document
processing 753 document
processing 754 document
processing 755 document
processing 756 document
processing 757 document
processing 758 document
processing 759 document
processing 760 document
processing 761 document
processing 762 document
processing 763 document
processing 764 document
processing 765 document
processing 766 document
processing 767 document
processing 768 document
processing 769 document
processing 770 document
processing 771 document
processing 772 document
processing 773 document
processing 774 document
processing 775 document
processing 776 document
processing 777 document
processing 778 document
processing 779 document
processing 780 document
processing 781 document
processing 782 document
processing 783 document
processing 784 document
processing 785 document
processing 786 document
processing 787 document
processing 788 document
processing 789 document
processing 790 document
processing 791 document
processing 792 document
processing 793 document
processing 794 document
processing 795 document
processing 796 document
processing 797 document
processing 798 document
processing 799 document
processing 800 document
processing 801 document
processing 802 document
processing 803 document
processing 804 document
processing 805 document
processing 806 document
processing 807 document
processing 808 document
processing 809 document
processing 810 document
processing 811 document
processing 812 document
processing 813 document
processing 814 document
processing 815 document
processing 816 document
processing 817 document
processing 818 document
processing 819 document
processing 820 document
processing 821 document
processing 822 document
processing 823 document
processing 824 document
processing 825 document
processing 826 document
processing 827 document
processing 828 document
processing 829 document
processing 830 document
processing 831 document
processing 832 document
processing 833 document
processing 834 document
processing 835 document
processing 836 document
processing 837 document
processing 838 document
processing 839 document
processing 840 document
processing 841 document
processing 842 document
processing 843 document
processing 844 document
processing 845 document
processing 846 document
processing 847 document
processing 848 document
processing 849 document
processing 850 document
processing 851 document
processing 852 document
processing 853 document
processing 854 document
processing 855 document
processing 856 document
processing 857 document
processing 858 document
processing 859 document
processing 860 document
processing 861 document
processing 862 document
processing 863 document
processing 864 document
processing 865 document
processing 866 document
processing 867 document
processing 868 document
processing 869 document
processing 870 document
processing 871 document
processing 872 document
processing 873 document
processing 874 document
processing 875 document
processing 876 document
processing 877 document
processing 878 document
processing 879 document
processing 880 document
processing 881 document
processing 882 document
processing 883 document
processing 884 document
processing 885 document
processing 886 document
processing 887 document
processing 888 document
processing 889 document
processing 890 document
processing 891 document
processing 892 document
processing 893 document
processing 894 document
processing 895 document
processing 896 document
processing 897 document
processing 898 document
processing 899 document
processing 900 document
processing 901 document
processing 902 document
processing 903 document
processing 904 document
processing 905 document
processing 906 document
processing 907 document
processing 908 document
processing 909 document
processing 910 document
processing 911 document
processing 912 document
processing 913 document
processing 914 document
processing 915 document
processing 916 document
processing 917 document
processing 918 document
processing 919 document
processing 920 document
processing 921 document
processing 922 document
processing 923 document
processing 924 document
processing 925 document
processing 926 document
processing 927 document
processing 928 document
processing 929 document
processing 930 document
processing 931 document
processing 932 document
processing 933 document
processing 934 document
processing 935 document
processing 936 document
processing 937 document
processing 938 document
processing 939 document
processing 940 document
processing 941 document
processing 942 document
processing 943 document
processing 944 document
processing 945 document
processing 946 document
processing 947 document
processing 948 document
processing 949 document
processing 950 document
processing 951 document
processing 952 document
processing 953 document
processing 954 document
processing 955 document
processing 956 document
processing 957 document
processing 958 document
processing 959 document
processing 960 document
processing 961 document
processing 962 document
processing 963 document
processing 964 document
processing 965 document
processing 966 document
processing 967 document
processing 968 document
processing 969 document
processing 970 document
processing 971 document
processing 972 document
processing 973 document
processing 974 document
processing 975 document
processing 976 document
processing 977 document
processing 978 document
processing 979 document
processing 980 document
processing 981 document
processing 982 document
processing 983 document
processing 984 document
processing 985 document
processing 986 document
processing 987 document
processing 988 document
processing 989 document
processing 990 document
processing 991 document
processing 992 document
processing 993 document
processing 994 document
processing 995 document
processing 996 document
processing 997 document
processing 998 document
processing 999 document
processing 1000 document
processing 1001 document
processing 1002 document
processing 1003 document
processing 1004 document
processing 1005 document
processing 1006 document
processing 1007 document
processing 1008 document
processing 1009 document
processing 1010 document
processing 1011 document
processing 1012 document
processing 1013 document
processing 1014 document
processing 1015 document
processing 1016 document
processing 1017 document
processing 1018 document
processing 1019 document
processing 1020 document
processing 1021 document
processing 1022 document
processing 1023 document
processing 1024 document
processing 1025 document
processing 1026 document
processing 1027 document
processing 1028 document
processing 1029 document
processing 1030 document
processing 1031 document
processing 1032 document
processing 1033 document
processing 1034 document
processing 1035 document
processing 1036 document
processing 1037 document
processing 1038 document
processing 1039 document
processing 1040 document
processing 1041 document
processing 1042 document
processing 1043 document
processing 1044 document
processing 1045 document
processing 1046 document
processing 1047 document
processing 1048 document
processing 1049 document
processing 1050 document
processing 1051 document
processing 1052 document
processing 1053 document
processing 1054 document
processing 1055 document
processing 1056 document
processing 1057 document
processing 1058 document
processing 1059 document
processing 1060 document
processing 1061 document
processing 1062 document
processing 1063 document
processing 1064 document
processing 1065 document
processing 1066 document
processing 1067 document
processing 1068 document
processing 1069 document
processing 1070 document
processing 1071 document
processing 1072 document
processing 1073 document
processing 1074 document
processing 1075 document
processing 1076 document
processing 1077 document
processing 1078 document
processing 1079 document
processing 1080 document
processing 1081 document
processing 1082 document
processing 1083 document
processing 1084 document
processing 1085 document
processing 1086 document
processing 1087 document
processing 1088 document
processing 1089 document
processing 1090 document
processing 1091 document
processing 1092 document
processing 1093 document
processing 1094 document
processing 1095 document
processing 1096 document
processing 1097 document
processing 1098 document
processing 1099 document
processing 1100 document
processing 1101 document
processing 1102 document
processing 1103 document
processing 1104 document
processing 1105 document
processing 1106 document
processing 1107 document
processing 1108 document
processing 1109 document
processing 1110 document
processing 1111 document
processing 1112 document
processing 1113 document
processing 1114 document
processing 1115 document
processing 1116 document
processing 1117 document
processing 1118 document
processing 1119 document
processing 1120 document
processing 1121 document
processing 1122 document
processing 1123 document
processing 1124 document
processing 1125 document
processing 1126 document
processing 1127 document
processing 1128 document
processing 1129 document
processing 1130 document
processing 1131 document
processing 1132 document
processing 1133 document
processing 1134 document
processing 1135 document
processing 1136 document
processing 1137 document
processing 1138 document
processing 1139 document
processing 1140 document
processing 1141 document
processing 1142 document
processing 1143 document
processing 1144 document
processing 1145 document
processing 1146 document
processing 1147 document
processing 1148 document
processing 1149 document
processing 1150 document
processing 1151 document
processing 1152 document
processing 1153 document
processing 1154 document
processing 1155 document
processing 1156 document
processing 1157 document
processing 1158 document
processing 1159 document
processing 1160 document
processing 1161 document
processing 1162 document
processing 1163 document
processing 1164 document
processing 1165 document
processing 1166 document
processing 1167 document
processing 1168 document
processing 1169 document
processing 1170 document
processing 1171 document
processing 1172 document
processing 1173 document
processing 1174 document
processing 1175 document
processing 1176 document
processing 1177 document
processing 1178 document
processing 1179 document
processing 1180 document
processing 1181 document
processing 1182 document
processing 1183 document
processing 1184 document
processing 1185 document
processing 1186 document
processing 1187 document
processing 1188 document
processing 1189 document
processing 1190 document
processing 1191 document
processing 1192 document
processing 1193 document
processing 1194 document
processing 1195 document
processing 1196 document
processing 1197 document
processing 1198 document
processing 1199 document
processing 1200 document
processing 1201 document
processing 1202 document
processing 1203 document
processing 1204 document
processing 1205 document
processing 1206 document
processing 1207 document
processing 1208 document
processing 1209 document
processing 1210 document
processing 1211 document
processing 1212 document
processing 1213 document
processing 1214 document
processing 1215 document
processing 1216 document
processing 1217 document
processing 1218 document
processing 1219 document
processing 1220 document
processing 1221 document
processing 1222 document
processing 1223 document
processing 1224 document
processing 1225 document
processing 1226 document
processing 1227 document
processing 1228 document
processing 1229 document
processing 1230 document
processing 1231 document
processing 1232 document
processing 1233 document
processing 1234 document
processing 1235 document
processing 1236 document
processing 1237 document
processing 1238 document
processing 1239 document
processing 1240 document
processing 1241 document
processing 1242 document
processing 1243 document
processing 1244 document
processing 1245 document
processing 1246 document
processing 1247 document
processing 1248 document
processing 1249 document
processing 1250 document
processing 1251 document
processing 1252 document
processing 1253 document
processing 1254 document
processing 1255 document
processing 1256 document
processing 1257 document
processing 1258 document
processing 1259 document
processing 1260 document
processing 1261 document
processing 1262 document
processing 1263 document
processing 1264 document
processing 1265 document
processing 1266 document
processing 1267 document
processing 1268 document
processing 1269 document
processing 1270 document
processing 1271 document
processing 1272 document
processing 1273 document
processing 1274 document
processing 1275 document
processing 1276 document
processing 1277 document
processing 1278 document
processing 1279 document
processing 1280 document
processing 1281 document
processing 1282 document
processing 1283 document
processing 1284 document
processing 1285 document
processing 1286 document
processing 1287 document
processing 1288 document
processing 1289 document
processing 1290 document
processing 1291 document
processing 1292 document
processing 1293 document
processing 1294 document
processing 1295 document
processing 1296 document
processing 1297 document
processing 1298 document
processing 1299 document
processing 1300 document
processing 1301 document
processing 1302 document
processing 1303 document
processing 1304 document
processing 1305 document
processing 1306 document
processing 1307 document
processing 1308 document
processing 1309 document
processing 1310 document
processing 1311 document
processing 1312 document
processing 1313 document
processing 1314 document
processing 1315 document
processing 1316 document
processing 1317 document
processing 1318 document
processing 1319 document
processing 1320 document
processing 1321 document
processing 1322 document
processing 1323 document
processing 1324 document
processing 1325 document
processing 1326 document
processing 1327 document
processing 1328 document
processing 1329 document
processing 1330 document
processing 1331 document
processing 1332 document
processing 1333 document
processing 1334 document
processing 1335 document
processing 1336 document
processing 1337 document
processing 1338 document
processing 1339 document
processing 1340 document
processing 1341 document
processing 1342 document
processing 1343 document
processing 1344 document
processing 1345 document
processing 1346 document
processing 1347 document
processing 1348 document
processing 1349 document
processing 1350 document
processing 1351 document
processing 1352 document
processing 1353 document
processing 1354 document
processing 1355 document
processing 1356 document
processing 1357 document
processing 1358 document
processing 1359 document
processing 1360 document
processing 1361 document
processing 1362 document
processing 1363 document
processing 1364 document
processing 1365 document
processing 1366 document
processing 1367 document
processing 1368 document
processing 1369 document
processing 1370 document
processing 1371 document
processing 1372 document
processing 1373 document
processing 1374 document
processing 1375 document
processing 1376 document
processing 1377 document
processing 1378 document
processing 1379 document
processing 1380 document
processing 1381 document
processing 1382 document
processing 1383 document
processing 1384 document
processing 1385 document
processing 1386 document
processing 1387 document
processing 1388 document
processing 1389 document
processing 1390 document
processing 1391 document
processing 1392 document
processing 1393 document
processing 1394 document
processing 1395 document
processing 1396 document
processing 1397 document
processing 1398 document
processing 1399 document
processing 1400 document
processing 1401 document
processing 1402 document
processing 1403 document
processing 1404 document
processing 1405 document
processing 1406 document
processing 1407 document
processing 1408 document
processing 1409 document
processing 1410 document
processing 1411 document
processing 1412 document
processing 1413 document
processing 1414 document
processing 1415 document
processing 1416 document
processing 1417 document
processing 1418 document
processing 1419 document
processing 1420 document
processing 1421 document
processing 1422 document
processing 1423 document
processing 1424 document
processing 1425 document
processing 1426 document
processing 1427 document
processing 1428 document
processing 1429 document
processing 1430 document
processing 1431 document
processing 1432 document
processing 1433 document
processing 1434 document
processing 1435 document
processing 1436 document
processing 1437 document
processing 1438 document
processing 1439 document
processing 1440 document
processing 1441 document
processing 1442 document
processing 1443 document
processing 1444 document
processing 1445 document
processing 1446 document
processing 1447 document
processing 1448 document
processing 1449 document
processing 1450 document
processing 1451 document
processing 1452 document
processing 1453 document
processing 1454 document
processing 1455 document
processing 1456 document
processing 1457 document
processing 1458 document
processing 1459 document
processing 1460 document
processing 1461 document
processing 1462 document
processing 1463 document
processing 1464 document
processing 1465 document
processing 1466 document
processing 1467 document
processing 1468 document
processing 1469 document
processing 1470 document
processing 1471 document
processing 1472 document
processing 1473 document
processing 1474 document
processing 1475 document
processing 1476 document
processing 1477 document
processing 1478 document
processing 1479 document
processing 1480 document
processing 1481 document
processing 1482 document
processing 1483 document
processing 1484 document
processing 1485 document
processing 1486 document
processing 1487 document
processing 1488 document
processing 1489 document
processing 1490 document
processing 1491 document
processing 1492 document
processing 1493 document
processing 1494 document
processing 1495 document
processing 1496 document
processing 1497 document
processing 1498 document
processing 1499 document
processing 1500 document
processing 1501 document
processing 1502 document
processing 1503 document
processing 1504 document
processing 1505 document
processing 1506 document
processing 1507 document
processing 1508 document
processing 1509 document
processing 1510 document
processing 1511 document
processing 1512 document
processing 1513 document
processing 1514 document
processing 1515 document
processing 1516 document
processing 1517 document
processing 1518 document
processing 1519 document
processing 1520 document
processing 1521 document
processing 1522 document
processing 1523 document
processing 1524 document
processing 1525 document
processing 1526 document
processing 1527 document
processing 1528 document
processing 1529 document
processing 1530 document
processing 1531 document
processing 1532 document
processing 1533 document
processing 1534 document
processing 1535 document
processing 1536 document
processing 1537 document
processing 1538 document
processing 1539 document
processing 1540 document
processing 1541 document
processing 1542 document
processing 1543 document
processing 1544 document
processing 1545 document
processing 1546 document
processing 1547 document
processing 1548 document
processing 1549 document
processing 1550 document
processing 1551 document
processing 1552 document
processing 1553 document
processing 1554 document
processing 1555 document
processing 1556 document
processing 1557 document
processing 1558 document
processing 1559 document
processing 1560 document
processing 1561 document
processing 1562 document
processing 1563 document
processing 1564 document
processing 1565 document
processing 1566 document
processing 1567 document
processing 1568 document
processing 1569 document
processing 1570 document
processing 1571 document
processing 1572 document
processing 1573 document
processing 1574 document
processing 1575 document
processing 1576 document
processing 1577 document
processing 1578 document
processing 1579 document
processing 1580 document
processing 1581 document
processing 1582 document
processing 1583 document
processing 1584 document
processing 1585 document
processing 1586 document
processing 1587 document
processing 1588 document
processing 1589 document
processing 1590 document
processing 1591 document
processing 1592 document
processing 1593 document
processing 1594 document
processing 1595 document
processing 1596 document
processing 1597 document
processing 1598 document
processing 1599 document
processing 1600 document
processing 1601 document
processing 1602 document
processing 1603 document
processing 1604 document
processing 1605 document
processing 1606 document
processing 1607 document
processing 1608 document
processing 1609 document
processing 1610 document
processing 1611 document
processing 1612 document
processing 1613 document
processing 1614 document
processing 1615 document
processing 1616 document
processing 1617 document
processing 1618 document
processing 1619 document
processing 1620 document
processing 1621 document
processing 1622 document
processing 1623 document
processing 1624 document
processing 1625 document
processing 1626 document
processing 1627 document
processing 1628 document
processing 1629 document
processing 1630 document
processing 1631 document
processing 1632 document
processing 1633 document
processing 1634 document
processing 1635 document
processing 1636 document
processing 1637 document
processing 1638 document
processing 1639 document
processing 1640 document
processing 1641 document
processing 1642 document
processing 1643 document
processing 1644 document
processing 1645 document
processing 1646 document
processing 1647 document
processing 1648 document
processing 1649 document
processing 1650 document
processing 1651 document
processing 1652 document
processing 1653 document
processing 1654 document
processing 1655 document
processing 1656 document
processing 1657 document
processing 1658 document
processing 1659 document
processing 1660 document
processing 1661 document
processing 1662 document
processing 1663 document
processing 1664 document
processing 1665 document
processing 1666 document
processing 1667 document
processing 1668 document
processing 1669 document
processing 1670 document
processing 1671 document
processing 1672 document
processing 1673 document
processing 1674 document
processing 1675 document
processing 1676 document
processing 1677 document
processing 1678 document
processing 1679 document
processing 1680 document
processing 1681 document
processing 1682 document
processing 1683 document
processing 1684 document
processing 1685 document
processing 1686 document
processing 1687 document
processing 1688 document
processing 1689 document
processing 1690 document
processing 1691 document
processing 1692 document
processing 1693 document
processing 1694 document
processing 1695 document
processing 1696 document
processing 1697 document
processing 1698 document
processing 1699 document
processing 1700 document
processing 1701 document
processing 1702 document
processing 1703 document
processing 1704 document
processing 1705 document
processing 1706 document
processing 1707 document
processing 1708 document
processing 1709 document
processing 1710 document
processing 1711 document
processing 1712 document
processing 1713 document
processing 1714 document
processing 1715 document
processing 1716 document
processing 1717 document
processing 1718 document
processing 1719 document
processing 1720 document
processing 1721 document
processing 1722 document
processing 1723 document
processing 1724 document
processing 1725 document
processing 1726 document
processing 1727 document
processing 1728 document
processing 1729 document
processing 1730 document
processing 1731 document
processing 1732 document
processing 1733 document
processing 1734 document
processing 1735 document
processing 1736 document
processing 1737 document
processing 1738 document
processing 1739 document
processing 1740 document
processing 1741 document
processing 1742 document
processing 1743 document
processing 1744 document
processing 1745 document
processing 1746 document
processing 1747 document
processing 1748 document
processing 1749 document
processing 1750 document
processing 1751 document
processing 1752 document
processing 1753 document
processing 1754 document
processing 1755 document
processing 1756 document
processing 1757 document
processing 1758 document
processing 1759 document
processing 1760 document
processing 1761 document
processing 1762 document
processing 1763 document
processing 1764 document
processing 1765 document
processing 1766 document
processing 1767 document
processing 1768 document
processing 1769 document
processing 1770 document
processing 1771 document
processing 1772 document
processing 1773 document
processing 1774 document
processing 1775 document
processing 1776 document
processing 1777 document
processing 1778 document
processing 1779 document
processing 1780 document
processing 1781 document
processing 1782 document
processing 1783 document
processing 1784 document
processing 1785 document
processing 1786 document
processing 1787 document
processing 1788 document
processing 1789 document
processing 1790 document
processing 1791 document
processing 1792 document
processing 1793 document
processing 1794 document
processing 1795 document
processing 1796 document
processing 1797 document
processing 1798 document
processing 1799 document
processing 1800 document
processing 1801 document
processing 1802 document
processing 1803 document
processing 1804 document
processing 1805 document
processing 1806 document
processing 1807 document
processing 1808 document
processing 1809 document
processing 1810 document
processing 1811 document
processing 1812 document
processing 1813 document
processing 1814 document
processing 1815 document
processing 1816 document
processing 1817 document
processing 1818 document
processing 1819 document
processing 1820 document
processing 1821 document
processing 1822 document
processing 1823 document
processing 1824 document
processing 1825 document
processing 1826 document
processing 1827 document
processing 1828 document
processing 1829 document
processing 1830 document
processing 1831 document
processing 1832 document
processing 1833 document
processing 1834 document
processing 1835 document
processing 1836 document
processing 1837 document
processing 1838 document
processing 1839 document
processing 1840 document
processing 1841 document
processing 1842 document
processing 1843 document
processing 1844 document
processing 1845 document
processing 1846 document
processing 1847 document
processing 1848 document
processing 1849 document
processing 1850 document
processing 1851 document
processing 1852 document
processing 1853 document
processing 1854 document
processing 1855 document
processing 1856 document
processing 1857 document
processing 1858 document
processing 1859 document
processing 1860 document
processing 1861 document
processing 1862 document
processing 1863 document
processing 1864 document
processing 1865 document
processing 1866 document
processing 1867 document
processing 1868 document
processing 1869 document
processing 1870 document
processing 1871 document
processing 1872 document
processing 1873 document
processing 1874 document
processing 1875 document
processing 1876 document
processing 1877 document
processing 1878 document
processing 1879 document
processing 1880 document
processing 1881 document
processing 1882 document
processing 1883 document
processing 1884 document
processing 1885 document
processing 1886 document
processing 1887 document
processing 1888 document
processing 1889 document
processing 1890 document
processing 1891 document
processing 1892 document
processing 1893 document
processing 1894 document
processing 1895 document
processing 1896 document
processing 1897 document
processing 1898 document
processing 1899 document
processing 1900 document
processing 1901 document
processing 1902 document
processing 1903 document
processing 1904 document
processing 1905 document
processing 1906 document
processing 1907 document
processing 1908 document
processing 1909 document
processing 1910 document
processing 1911 document
processing 1912 document
processing 1913 document
processing 1914 document
processing 1915 document
processing 1916 document
processing 1917 document
processing 1918 document
processing 1919 document
processing 1920 document
processing 1921 document
processing 1922 document
processing 1923 document
processing 1924 document
processing 1925 document
processing 1926 document
processing 1927 document
processing 1928 document
processing 1929 document
processing 1930 document
processing 1931 document
processing 1932 document
processing 1933 document
processing 1934 document
processing 1935 document
processing 1936 document
processing 1937 document
processing 1938 document
processing 1939 document
processing 1940 document
processing 1941 document
processing 1942 document
processing 1943 document
processing 1944 document
processing 1945 document
processing 1946 document
processing 1947 document
processing 1948 document
processing 1949 document
processing 1950 document
processing 1951 document
processing 1952 document
processing 1953 document
processing 1954 document
processing 1955 document
processing 1956 document
processing 1957 document
processing 1958 document
processing 1959 document
processing 1960 document
processing 1961 document
processing 1962 document
processing 1963 document
processing 1964 document
processing 1965 document
processing 1966 document
processing 1967 document
processing 1968 document
processing 1969 document
processing 1970 document
processing 1971 document
processing 1972 document
processing 1973 document
processing 1974 document
processing 1975 document
processing 1976 document
processing 1977 document
processing 1978 document
processing 1979 document
processing 1980 document
processing 1981 document
processing 1982 document
processing 1983 document
processing 1984 document
processing 1985 document
processing 1986 document
processing 1987 document
processing 1988 document
processing 1989 document
processing 1990 document
processing 1991 document
processing 1992 document
processing 1993 document
processing 1994 document
processing 1995 document
processing 1996 document
processing 1997 document
processing 1998 document
processing 1999 document
processing 2000 document
processing 2001 document
processing 2002 document
processing 2003 document
processing 2004 document
processing 2005 document
processing 2006 document
processing 2007 document
processing 2008 document
processing 2009 document
processing 2010 document
processing 2011 document
processing 2012 document
processing 2013 document
processing 2014 document
processing 2015 document
processing 2016 document
processing 2017 document
processing 2018 document
processing 2019 document
processing 2020 document
processing 2021 document
processing 2022 document
processing 2023 document
processing 2024 document
processing 2025 document
processing 2026 document
processing 2027 document
processing 2028 document
processing 2029 document
processing 2030 document
processing 2031 document
processing 2032 document
processing 2033 document
processing 2034 document
processing 2035 document
processing 2036 document
processing 2037 document
processing 2038 document
processing 2039 document
processing 2040 document
processing 2041 document
processing 2042 document
processing 2043 document
processing 2044 document
processing 2045 document
processing 2046 document
processing 2047 document
processing 2048 document
processing 2049 document
processing 2050 document
processing 2051 document
processing 2052 document
processing 2053 document
processing 2054 document
processing 2055 document
processing 2056 document
processing 2057 document
processing 2058 document
processing 2059 document
processing 2060 document
processing 2061 document
processing 2062 document
processing 2063 document
processing 2064 document
processing 2065 document
processing 2066 document
processing 2067 document
processing 2068 document
processing 2069 document
processing 2070 document
processing 2071 document
processing 2072 document
processing 2073 document
processing 2074 document
processing 2075 document
processing 2076 document
processing 2077 document
processing 2078 document
processing 2079 document
processing 2080 document
processing 2081 document
processing 2082 document
processing 2083 document
processing 2084 document
processing 2085 document
processing 2086 document
processing 2087 document
processing 2088 document
processing 2089 document
processing 2090 document
processing 2091 document
processing 2092 document
processing 2093 document
processing 2094 document
processing 2095 document
processing 2096 document
processing 2097 document
processing 2098 document
processing 2099 document
processing 2100 document
processing 2101 document
processing 2102 document
processing 2103 document
processing 2104 document
processing 2105 document
processing 2106 document
processing 2107 document
processing 2108 document
processing 2109 document
processing 2110 document
processing 2111 document
processing 2112 document
processing 2113 document
processing 2114 document
processing 2115 document
processing 2116 document
processing 2117 document
processing 2118 document
processing 2119 document
processing 2120 document
processing 2121 document
processing 2122 document
processing 2123 document
processing 2124 document
processing 2125 document
processing 2126 document
processing 2127 document
processing 2128 document
processing 2129 document
processing 2130 document
processing 2131 document
processing 2132 document
processing 2133 document
processing 2134 document
processing 2135 document
processing 2136 document
processing 2137 document
processing 2138 document
processing 2139 document
processing 2140 document
processing 2141 document
processing 2142 document
processing 2143 document
processing 2144 document
processing 2145 document
processing 2146 document
processing 2147 document
processing 2148 document
processing 2149 document
processing 2150 document
processing 2151 document
processing 2152 document
processing 2153 document
processing 2154 document
processing 2155 document
processing 2156 document
processing 2157 document
processing 2158 document
processing 2159 document
processing 2160 document
processing 2161 document
processing 2162 document
processing 2163 document
processing 2164 document
processing 2165 document
processing 2166 document
processing 2167 document
processing 2168 document
processing 2169 document
processing 2170 document
processing 2171 document
processing 2172 document
processing 2173 document
processing 2174 document
processing 2175 document
processing 2176 document
processing 2177 document
processing 2178 document
processing 2179 document
processing 2180 document
processing 2181 document
processing 2182 document
processing 2183 document
processing 2184 document
processing 2185 document
processing 2186 document
processing 2187 document
processing 2188 document
processing 2189 document
processing 2190 document
processing 2191 document
processing 2192 document
processing 2193 document
processing 2194 document
processing 2195 document
processing 2196 document
processing 2197 document
processing 2198 document
processing 2199 document
processing 2200 document
processing 2201 document
processing 2202 document
processing 2203 document
processing 2204 document
processing 2205 document
processing 2206 document
processing 2207 document
processing 2208 document
processing 2209 document
processing 2210 document
processing 2211 document
processing 2212 document
processing 2213 document
processing 2214 document
processing 2215 document
processing 2216 document
processing 2217 document
processing 2218 document
processing 2219 document
processing 2220 document
processing 2221 document
processing 2222 document
processing 2223 document
processing 2224 document
processing 2225 document
processing 2226 document
processing 2227 document
processing 2228 document
processing 2229 document
processing 2230 document
processing 2231 document
processing 2232 document
processing 2233 document
processing 2234 document
processing 2235 document
processing 2236 document
processing 2237 document
processing 2238 document
processing 2239 document
processing 2240 document
processing 2241 document
processing 2242 document
processing 2243 document
processing 2244 document
processing 2245 document
processing 2246 document
processing 2247 document
processing 2248 document
processing 2249 document
processing 2250 document
processing 2251 document
processing 2252 document
processing 2253 document
processing 2254 document
processing 2255 document
processing 2256 document
processing 2257 document
processing 2258 document
processing 2259 document
processing 2260 document
processing 2261 document
processing 2262 document
processing 2263 document
processing 2264 document
processing 2265 document
processing 2266 document
processing 2267 document
processing 2268 document
processing 2269 document
processing 2270 document
processing 2271 document
processing 2272 document
processing 2273 document
processing 2274 document
processing 2275 document
processing 2276 document
processing 2277 document
processing 2278 document
processing 2279 document
processing 2280 document
processing 2281 document
processing 2282 document
processing 2283 document
processing 2284 document
processing 2285 document
processing 2286 document
processing 2287 document
processing 2288 document
processing 2289 document
processing 2290 document
processing 2291 document
processing 2292 document
processing 2293 document
processing 2294 document
processing 2295 document
processing 2296 document
processing 2297 document
processing 2298 document
processing 2299 document
processing 2300 document
processing 2301 document
processing 2302 document
processing 2303 document
processing 2304 document
processing 2305 document
processing 2306 document
processing 2307 document
processing 2308 document
processing 2309 document
processing 2310 document
processing 2311 document
processing 2312 document
processing 2313 document
processing 2314 document
processing 2315 document
processing 2316 document
processing 2317 document
processing 2318 document
processing 2319 document
processing 2320 document
processing 2321 document
processing 2322 document
processing 2323 document
processing 2324 document
processing 2325 document
processing 2326 document
processing 2327 document
processing 2328 document
processing 2329 document
processing 2330 document
processing 2331 document
processing 2332 document
processing 2333 document
processing 2334 document
processing 2335 document
processing 2336 document
processing 2337 document
processing 2338 document
processing 2339 document
processing 2340 document
processing 2341 document
processing 2342 document
processing 2343 document
processing 2344 document
processing 2345 document
processing 2346 document
processing 2347 document
processing 2348 document
processing 2349 document
processing 2350 document
processing 2351 document
processing 2352 document
processing 2353 document
processing 2354 document
processing 2355 document
processing 2356 document
processing 2357 document
processing 2358 document
processing 2359 document
processing 2360 document
processing 2361 document
processing 2362 document
processing 2363 document
processing 2364 document
processing 2365 document
processing 2366 document
processing 2367 document
processing 2368 document
processing 2369 document
processing 2370 document
processing 2371 document
processing 2372 document
processing 2373 document
processing 2374 document
processing 2375 document
processing 2376 document
processing 2377 document
processing 2378 document
processing 2379 document
processing 2380 document
processing 2381 document
processing 2382 document
processing 2383 document
processing 2384 document
processing 2385 document
processing 2386 document
processing 2387 document
processing 2388 document
processing 2389 document
processing 2390 document
processing 2391 document
processing 2392 document
processing 2393 document
processing 2394 document
processing 2395 document
processing 2396 document
processing 2397 document
processing 2398 document
processing 2399 document
processing 2400 document
processing 2401 document
processing 2402 document
processing 2403 document
processing 2404 document
processing 2405 document
processing 2406 document
processing 2407 document
processing 2408 document
processing 2409 document
processing 2410 document
processing 2411 document
processing 2412 document
processing 2413 document
processing 2414 document
processing 2415 document
processing 2416 document
processing 2417 document
processing 2418 document
processing 2419 document
processing 2420 document
processing 2421 document
processing 2422 document
processing 2423 document
processing 2424 document
processing 2425 document
processing 2426 document
processing 2427 document
processing 2428 document
processing 2429 document
processing 2430 document
processing 2431 document
processing 2432 document
processing 2433 document
processing 2434 document
processing 2435 document
processing 2436 document
processing 2437 document
processing 2438 document
processing 2439 document
processing 2440 document
processing 2441 document
processing 2442 document
processing 2443 document
processing 2444 document
processing 2445 document
processing 2446 document
processing 2447 document
processing 2448 document
processing 2449 document
processing 2450 document
processing 2451 document
processing 2452 document
processing 2453 document
processing 2454 document
processing 2455 document
processing 2456 document
processing 2457 document
processing 2458 document
processing 2459 document
processing 2460 document
processing 2461 document
processing 2462 document
processing 2463 document
processing 2464 document
processing 2465 document
processing 2466 document
processing 2467 document
processing 2468 document
processing 2469 document
processing 2470 document
processing 2471 document
processing 2472 document
processing 2473 document
processing 2474 document
processing 2475 document
processing 2476 document
processing 2477 document
processing 2478 document
processing 2479 document
processing 2480 document
processing 2481 document
processing 2482 document
processing 2483 document
processing 2484 document
processing 2485 document
processing 2486 document
processing 2487 document
processing 2488 document
processing 2489 document
processing 2490 document
processing 2491 document
processing 2492 document
processing 2493 document
processing 2494 document
processing 2495 document
processing 2496 document
processing 2497 document
processing 2498 document
processing 2499 document
processing 2500 document
processing 2501 document
processing 2502 document
processing 2503 document
processing 2504 document
processing 2505 document
processing 2506 document
processing 2507 document
processing 2508 document
processing 2509 document
processing 2510 document
processing 2511 document
processing 2512 document
processing 2513 document
processing 2514 document
processing 2515 document
processing 2516 document
processing 2517 document
processing 2518 document
processing 2519 document
processing 2520 document
processing 2521 document
processing 2522 document
processing 2523 document
processing 2524 document
processing 2525 document
processing 2526 document
processing 2527 document
processing 2528 document
processing 2529 document
processing 2530 document
processing 2531 document
processing 2532 document
processing 2533 document
processing 2534 document
processing 2535 document
processing 2536 document
processing 2537 document
processing 2538 document
processing 2539 document
processing 2540 document
processing 2541 document
processing 2542 document
processing 2543 document
processing 2544 document
processing 2545 document
processing 2546 document
processing 2547 document
processing 2548 document
processing 2549 document
processing 2550 document
processing 2551 document
processing 2552 document
processing 2553 document
processing 2554 document
processing 2555 document
processing 2556 document
processing 2557 document
processing 2558 document
processing 2559 document
processing 2560 document
processing 2561 document
processing 2562 document
processing 2563 document
processing 2564 document
processing 2565 document
processing 2566 document
processing 2567 document
processing 2568 document
processing 2569 document
processing 2570 document
processing 2571 document
processing 2572 document
processing 2573 document
processing 2574 document
processing 2575 document
processing 2576 document
processing 2577 document
processing 2578 document
processing 2579 document
processing 2580 document
processing 2581 document
processing 2582 document
processing 2583 document
processing 2584 document
processing 2585 document
processing 2586 document
processing 2587 document
processing 2588 document
processing 2589 document
processing 2590 document
processing 2591 document
processing 2592 document
processing 2593 document
processing 2594 document
processing 2595 document
processing 2596 document
processing 2597 document
processing 2598 document
processing 2599 document
processing 2600 document
processing 2601 document
processing 2602 document
processing 2603 document
processing 2604 document
processing 2605 document
processing 2606 document
processing 2607 document
processing 2608 document
processing 2609 document
processing 2610 document
processing 2611 document
processing 2612 document
processing 2613 document
processing 2614 document
processing 2615 document
processing 2616 document
processing 2617 document
processing 2618 document
processing 2619 document
processing 2620 document
processing 2621 document
processing 2622 document
processing 2623 document
processing 2624 document
processing 2625 document
processing 2626 document
processing 2627 document
processing 2628 document
processing 2629 document
processing 2630 document
processing 2631 document
processing 2632 document
processing 2633 document
processing 2634 document
processing 2635 document
processing 2636 document
processing 2637 document
processing 2638 document
processing 2639 document
processing 2640 document
processing 2641 document
processing 2642 document
processing 2643 document
processing 2644 document
processing 2645 document
processing 2646 document
processing 2647 document
processing 2648 document
processing 2649 document
processing 2650 document
processing 2651 document
processing 2652 document
processing 2653 document
processing 2654 document
processing 2655 document
processing 2656 document
processing 2657 document
processing 2658 document
processing 2659 document
processing 2660 document
processing 2661 document
processing 2662 document
processing 2663 document
processing 2664 document
processing 2665 document
processing 2666 document
processing 2667 document
processing 2668 document
processing 2669 document
processing 2670 document
processing 2671 document
processing 2672 document
processing 2673 document
processing 2674 document
processing 2675 document
processing 2676 document
processing 2677 document
processing 2678 document
processing 2679 document
processing 2680 document
processing 2681 document
processing 2682 document
processing 2683 document
processing 2684 document
processing 2685 document
processing 2686 document
processing 2687 document
processing 2688 document
processing 2689 document
processing 2690 document
processing 2691 document
processing 2692 document
processing 2693 document
processing 2694 document
processing 2695 document
processing 2696 document
processing 2697 document
processing 2698 document
processing 2699 document
processing 2700 document
processing 2701 document
processing 2702 document
processing 2703 document
processing 2704 document
processing 2705 document
processing 2706 document
processing 2707 document
processing 2708 document
processing 2709 document
processing 2710 document
processing 2711 document
processing 2712 document
processing 2713 document
processing 2714 document
processing 2715 document
processing 2716 document
processing 2717 document
processing 2718 document
processing 2719 document
processing 2720 document
processing 2721 document
processing 2722 document
processing 2723 document
processing 2724 document
processing 2725 document
processing 2726 document
processing 2727 document
processing 2728 document
processing 2729 document
processing 2730 document
processing 2731 document
processing 2732 document
processing 2733 document
processing 2734 document
processing 2735 document
processing 2736 document
processing 2737 document
processing 2738 document
processing 2739 document
processing 2740 document
processing 2741 document
processing 2742 document
processing 2743 document
processing 2744 document
processing 2745 document
processing 2746 document
processing 2747 document
processing 2748 document
processing 2749 document
processing 2750 document
processing 2751 document
processing 2752 document
processing 2753 document
processing 2754 document
processing 2755 document
processing 2756 document
processing 2757 document
processing 2758 document
processing 2759 document
processing 2760 document
processing 2761 document
processing 2762 document
processing 2763 document
processing 2764 document
processing 2765 document
processing 2766 document
processing 2767 document
processing 2768 document
processing 2769 document
processing 2770 document
processing 2771 document
processing 2772 document
processing 2773 document
processing 2774 document
processing 2775 document
processing 2776 document
processing 2777 document
processing 2778 document
processing 2779 document
processing 2780 document
processing 2781 document
processing 2782 document
processing 2783 document
processing 2784 document
processing 2785 document
processing 2786 document
processing 2787 document
processing 2788 document
processing 2789 document
processing 2790 document
processing 2791 document
processing 2792 document
processing 2793 document
processing 2794 document
processing 2795 document
processing 2796 document
processing 2797 document
processing 2798 document
processing 2799 document
processing 2800 document
processing 2801 document
processing 2802 document
processing 2803 document
processing 2804 document
processing 2805 document
processing 2806 document
processing 2807 document
processing 2808 document
processing 2809 document
processing 2810 document
processing 2811 document
processing 2812 document
processing 2813 document
processing 2814 document
processing 2815 document
processing 2816 document
processing 2817 document
processing 2818 document
processing 2819 document
processing 2820 document
processing 2821 document
processing 2822 document
processing 2823 document
processing 2824 document
processing 2825 document
processing 2826 document
processing 2827 document
processing 2828 document
processing 2829 document
processing 2830 document
processing 2831 document
processing 2832 document
processing 2833 document
processing 2834 document
processing 2835 document
processing 2836 document
processing 2837 document
processing 2838 document
processing 2839 document
processing 2840 document
processing 2841 document
processing 2842 document
processing 2843 document
processing 2844 document
processing 2845 document
processing 2846 document
processing 2847 document
processing 2848 document
processing 2849 document
processing 2850 document
processing 2851 document
processing 2852 document
processing 2853 document
processing 2854 document
processing 2855 document
processing 2856 document
processing 2857 document
processing 2858 document
processing 2859 document
processing 2860 document
processing 2861 document
processing 2862 document
processing 2863 document
processing 2864 document
processing 2865 document
processing 2866 document
processing 2867 document
processing 2868 document
processing 2869 document
processing 2870 document
processing 2871 document
processing 2872 document
processing 2873 document
processing 2874 document
processing 2875 document
processing 2876 document
processing 2877 document
processing 2878 document
processing 2879 document
processing 2880 document
processing 2881 document
processing 2882 document
processing 2883 document
processing 2884 document
processing 2885 document
processing 2886 document
processing 2887 document
processing 2888 document
processing 2889 document
processing 2890 document
processing 2891 document
processing 2892 document
processing 2893 document
processing 2894 document
processing 2895 document
processing 2896 document
processing 2897 document
processing 2898 document
processing 2899 document
processing 2900 document
processing 2901 document
processing 2902 document
processing 2903 document
processing 2904 document
processing 2905 document
processing 2906 document
processing 2907 document
processing 2908 document
processing 2909 document
processing 2910 document
processing 2911 document
processing 2912 document
processing 2913 document
processing 2914 document
processing 2915 document
processing 2916 document
processing 2917 document
processing 2918 document
processing 2919 document
processing 2920 document
processing 2921 document
processing 2922 document
processing 2923 document
processing 2924 document
processing 2925 document
processing 2926 document
processing 2927 document
processing 2928 document
processing 2929 document
processing 2930 document
processing 2931 document
processing 2932 document
processing 2933 document
processing 2934 document
processing 2935 document
processing 2936 document
processing 2937 document
processing 2938 document
processing 2939 document
processing 2940 document
processing 2941 document
processing 2942 document
processing 2943 document
processing 2944 document
processing 2945 document
processing 2946 document
processing 2947 document
processing 2948 document
processing 2949 document
processing 2950 document
processing 2951 document
processing 2952 document
processing 2953 document
processing 2954 document
processing 2955 document
processing 2956 document
processing 2957 document
processing 2958 document
processing 2959 document
processing 2960 document
processing 2961 document
processing 2962 document
processing 2963 document
processing 2964 document
processing 2965 document
processing 2966 document
processing 2967 document
processing 2968 document
processing 2969 document
processing 2970 document
processing 2971 document
processing 2972 document
processing 2973 document
processing 2974 document
processing 2975 document
processing 2976 document
processing 2977 document
processing 2978 document
processing 2979 document
processing 2980 document
processing 2981 document
processing 2982 document
processing 2983 document
processing 2984 document
processing 2985 document
processing 2986 document
processing 2987 document
processing 2988 document
processing 2989 document
processing 2990 document
processing 2991 document
processing 2992 document
processing 2993 document
processing 2994 document
processing 2995 document
processing 2996 document
processing 2997 document
processing 2998 document
processing 2999 document
In [10]:
#Creating Vocab frame list of words

all_vocab_frame = pd.DataFrame({'words': allvocab_tokenized}, index = allvocab_lemmatized)
print ('there are ' + str(all_vocab_frame.shape[0]) + ' items in all_vocab_frame')

vocab_frame = pd.DataFrame({'words': vocab_tokenized}, index = vocab_lemmatized)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
there are 49000 items in all_vocab_frame
there are 41879 items in vocab_frame
In [292]:
print (vocab_frame.head(20))
                         words
amazing                Amazing
cinematography  cinematography
know                      know
wonderful            wonderful
love                     loved
movie                    movie
pretty                  pretty
awesome                awesome
floor                  floored
story                    story
cgi                        CGI
bring                  Brought
kid                       kids
original              original
age                       ages
ago                        ago
time                      time
bring                  brought
grandchild       grandchildren
nostalgic            nostalgic
In [293]:
#To take unique vocab words
values, counts = np.unique(vocab_frame, return_counts=True)
all_values, all_counts = np.unique(all_vocab_frame, return_counts=True)
In [294]:
#To sort the vocab words
sorted_indices = np.argsort(-counts)
print(sorted_indices)
all_sorted_indices = np.argsort(-all_counts)
print(all_sorted_indices)
[3818 3995 3547 ... 2381 2357 5801]
[5747 6147 1651 ... 2672 2647 6311]
In [295]:
# To view the vocab words with counts
values = values[sorted_indices]
counts = counts[sorted_indices]

all_values = all_values[all_sorted_indices]
all_counts = all_counts[all_sorted_indices]

Frequently (Most commonly) used words in Audience Reviews

In [305]:
#Plot to print Frequencey words

font = {'weight' : 'bold',
        'size'   : 50}

plt.rc('font', **font)
fig = plt.figure(figsize=(70,70))
plt.barh(values[:75], counts[:75])
plt.gca().invert_yaxis()
plt.show()
Out[305]:
<BarContainer object of 75 artists>
In [266]:
#Import Stop words library
from spacy.lang.en.stop_words import STOP_WORDS
In [285]:
# Add word Movie to stop words - default list
stopwords1 = ['movie'] + list(STOP_WORDS)
print(stopwords1[:10],"\n\n")
['movie', "'re", 'last', 'are', 'full', 'upon', 'twelve', 'third', 'thence', '’m'] 


Plotting frequency plot after adding 'Movie' word to stop words.

In [296]:
#Plot to print Frequencey words

font = {'weight' : 'bold',
        'size'   : 50}

plt.rc('font', **font)
fig = plt.figure(figsize=(70,70))
plt.barh(values[:50], counts[:50])
plt.gca().invert_yaxis()
plt.line(x=200)
plt.show()
Out[296]:
<BarContainer object of 50 artists>

Insight from Frequncy word count(Most commonly used words)

  1. 40% of Audience compared movie with Original movie. ie. Part 1
  2. TOP words gives positive sentiment. Top words are Like, Good, Great, better.
  3. Animation, CGI, Voice, Cartoon, Song, Music were commented on technical aspect.
  4. Emotions, feel, love, life, realistic, Pride were commented on emotional aspect.
  5. Little King, Lion King, Simba were commented on characters.

Word Cloud Plot

In [297]:
from wordcloud import WordCloud
In [303]:
#Plot to get wordcloud
wordcloud = WordCloud().generate(data2['Review'][500])
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Out[303]:
<matplotlib.image.AxesImage at 0x1f8d7cf8>
Out[303]:
(-0.5, 399.5, 199.5, -0.5)

Insight from wordcloud

Animation and amazing are visible from word cloud.

Business suggestion on Merchandising words

In [307]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\keywords.png")
Out[307]:
In [308]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\simbatshirt.png")
Out[308]:
In [62]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\hakunamatata.png")
Out[62]:

6. Vectorizing words into Matrix form

We have two ways to vectorize the words into matrix form

  1. TFIDF vectorizer
  2. Count vectorizer.

Exploring TFIDF vectorizer further with 2 more options with text data,

  1. Non stop word removal text
  2. Lemmatized cleantext

6A. TFIDF vectorizer

In [11]:
## tfidf vectorizer needs sentence and not token. Hence we need to combine all the tokens back to form a string

data2['clean_text_stemmed'] = [' '.join(text) for text in data2['clean_text_stemmed']]
data2['clean_text_lemmatized'] = [' '.join(text) for text in data2['clean_text_lemmatized']]
data2['text_lemmatized'] = [' '.join(text) for text in data2['text_lemmatized']]
In [12]:
#To create new variables for Lemmatized clean text and Lemmatized non stop word removal text
cleantext_lemma = data2['clean_text_lemmatized']
text_lemma_nsw = data2['text_lemmatized']

For cleantext Lemmatized

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters for cleantext_lemma
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=7000,
                                 min_df=0.001,
                                 use_idf=True, ngram_range=(1,3))

tfidf_matrix_clntxt = tfidf_vectorizer.fit_transform(cleantext_lemma)

print(tfidf_matrix_clntxt.shape)
(3000, 3525)
In [14]:
#Converting sparse data to dense form
TV_Mat_clntxt = tfidf_matrix_clntxt.todense()
TV_Mat_clntxt
Out[14]:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
In [15]:
#For TFid vector - converting to dataframe
TV_Mat_clntxt = pd.DataFrame(TV_Mat_clntxt)
TV_Mat_clntxt.head()
Out[15]:
0 1 2 3 4 5 6 7 8 9 ... 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 3525 columns

In [16]:
#Define the Target Variable
TV_Mat_clntxt['Sentiment'] = ['0']*1500+['1']*1500

For Non stop word removal text

In [17]:
#define vectorizer parameters for text_lemma_nsw

#define vectorizer parameters for cleantext_lemma
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=7000,
                                 min_df=0.001,
                                 use_idf=True, ngram_range=(1,3))

tfidf_matrix_nsw = tfidf_vectorizer.fit_transform(text_lemma_nsw)

print(tfidf_matrix_nsw.shape)
(3000, 4248)
In [18]:
#Converting sparse data to dense form
TV_Mat_nsw = tfidf_matrix_nsw.todense()
TV_Mat_nsw
Out[18]:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
In [19]:
#For TFid vector - with non stop word removal text
TV_Mat_nsw = pd.DataFrame(TV_Mat_nsw)
TV_Mat_nsw.head()
Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 4248 columns

In [20]:
#Define the Target Variable
TV_Mat_nsw['Sentiment'] = ['0']*1500+['1']*1500

6B. Counter Vectorizer

Loading libraries for countVectorizer

In [21]:
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
In [22]:
#define count vectorizer parameters for cleantext_lemma
cv=CountVectorizer(stop_words='english',lowercase=True,
                   strip_accents='unicode',decode_error='ignore')

tdm = cv.fit_transform(data2['clean_text_lemmatized'])
tdm
Out[22]:
<3000x4414 sparse matrix of type '<class 'numpy.int64'>'
	with 35430 stored elements in Compressed Sparse Row format>
In [23]:
#Converting sparse data to dense form
Matrix = tdm.todense()
Matrix
Out[23]:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [24]:
#For Counter vectorizer
Mat = pd.DataFrame(Matrix)
Mat.head()
Out[24]:
0 1 2 3 4 5 6 7 8 9 ... 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 4414 columns

In [25]:
#Define the Target Variable
Mat['Sentiment'] = ['0']*1500+['1']*1500
Mat.head()
Out[25]:
0 1 2 3 4 5 6 7 8 9 ... 4405 4406 4407 4408 4409 4410 4411 4412 4413 Sentiment
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 4415 columns

7. Building KMeans Algorithm - Clustering

In [30]:
#Loading required Libraries for Clustering
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import joblib
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k, random_state=143)
    kmeanModel.fit(tfidf_matrix_nsw)
    Sum_of_squared_distances.append(kmeanModel.inertia_)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=1, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=6, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=7, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
Out[30]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=9, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=143, tol=0.0001, verbose=0)
In [31]:
## Plot the elbow

font = {'weight' : 'bold',
        'size'   : 10}

plt.rc('font', **font)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Out[31]:
[<matplotlib.lines.Line2D at 0x22d76390>]
Out[31]:
Text(0.5, 0, 'k')
Out[31]:
Text(0, 0.5, 'Sum_of_squared_distances')
Out[31]:
Text(0.5, 1.0, 'Elbow Method For Optimal k')

We shall ideally build clustering with K=4.

with interest to get insight on binary class clustering, we will build cluster with K=2

7A. Building Binary class Cluster

In [32]:
#Building Binary class cluster to represent Positive and Negative Sentiment.
num_clusters = 2

km = KMeans(n_clusters=num_clusters)

km.fit(tfidf_matrix_nsw)
#km.labels_
clusters = km.labels_.tolist()
#km.cluster_centers
centers = km.cluster_centers_
print(f"the cluster centers are {centers}")

joblib.dump(km,  'doc_cluster_best_K.pkl')
Out[32]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
the cluster centers are [[1.22424809e-03 2.85413750e-03 1.19173202e-04 ... 2.67839644e-04
  6.65947080e-05 1.83536280e-04]
 [1.13665009e-03 5.57407398e-03 6.89933721e-04 ... 3.10080292e-03
  1.54331796e-03 1.36289481e-04]]
Out[32]:
['doc_cluster_best_K.pkl']
In [33]:
#To view cluster center/Mean values
print(km.cluster_centers_)
print(km.cluster_centers_.shape)
[[1.22424809e-03 2.85413750e-03 1.19173202e-04 ... 2.67839644e-04
  6.65947080e-05 1.83536280e-04]
 [1.13665009e-03 5.57407398e-03 6.89933721e-04 ... 3.10080292e-03
  1.54331796e-03 1.36289481e-04]]
(2, 4248)
In [34]:
#To sort cluster center in order. 
km.cluster_centers_.argsort()
Out[34]:
array([[2671, 1812, 3138, ..., 2823, 2128, 2332],
       [3713, 2551, 2552, ...,  903,  895, 2617]], dtype=int64)
In [35]:
## Reversing the list so that index of max element is in 0th index
km.cluster_centers_.argsort()[:,::-1]
Out[35]:
array([[2332, 2128, 2823, ..., 3138, 1812, 2671],
       [2617,  895,  903, ..., 2552, 2551, 3713]], dtype=int64)
In [36]:
#To create cluster group feature in our data 
data2['cluster_group'] = clusters
data2.pop('clean_text', None)
pd.DataFrame(data2).head(5)
Out[36]:
index ReviewID Review Sentiment clean_text_stemmed clean_text_lemmatized text_stemmed text_lemmatized cluster_group
0 0 9bd27314-fc78-41fe-ba69-42669bc763d4 amazing cinematography dont know wonderful Pos amaz cinematographi nt know wonder amazing cinematography not know wonderful [amaz, cinematographi, do, nt, know, wonder] amazing cinematography do not know wonderful 1
1 1 966121979 loved movie Pos love movi love movie [love, movi] love movie 0
2 2 75D441F3-4AE6-4447-9702-8EDD3BA4153A pretty awesomei floored story cgi Pos pretti awesomei floor stori cgi pretty awesomei floored story cgi [pretti, awesomei, floor, stori, cgi] pretty awesomei floored story cgi 0
3 3 c7d41004-8ce5-46f7-ab89-b94d0c634bbe brought kids original ages ago time brought 3 ... Pos brought kid origin age ago time brought grandc... bring kid original age ago time bring grandchi... [brought, kid, origin, age, ago, time, brought... bring kid original age ago time bring grandchi... 0
4 4 2c88353b-108c-436d-bb5f-6ab4f9bb8641 grew watching kids version really brings life ... Pos grew watch kid version bring life awesom grow watch kid version bring life awesome [grew, watch, kid, version, realli, bring, lif... grow watch kid version really bring life awesome 0
In [37]:
#To create new Dataframe to proceed with few more steps
cluster_df = pd.DataFrame(data2)
In [38]:
#To view count of levels in each cluster
cluster_df['cluster_group'].value_counts()
Out[38]:
0    2384
1     616
Name: cluster_group, dtype: int64
In [39]:
##Step 1
cluster_df['tokenized_text'] = [text.split(' ') for text in cluster_df['text_lemmatized']]
In [40]:
##Step 2
grouped_text = cluster_df.groupby('cluster_group')['tokenized_text']
In [41]:
## Fetch entire tokenized text for specific group
grouped_text.get_group(0)
Out[41]:
1                                           [love, movie]
2                 [pretty, awesomei, floored, story, cgi]
3       [bring, kid, original, age, ago, time, bring, ...
4       [grow, watch, kid, version, really, bring, lif...
5                   [realistic, look, scary, little, one]
                              ...                        
2993    [try, make, film, realistic, fashion, lose, co...
2994    [absolutely, terrible, like, talk, 15x, speed,...
2995    [less, dynamic, 90, animation, broadway, adapt...
2996    [timon, pumba, good, part, voice, beyonc, awkw...
2999    [great, effect, slow, talk, much, like, watch,...
Name: tokenized_text, Length: 2384, dtype: object

Frequent word data in cluster wise

In [42]:
from itertools import chain
In [43]:
frequent_words_df = pd.DataFrame(columns={"values", "counts", "cluster_id"})
In [44]:
for num in range(num_clusters):
    values, counts = np.unique(list(chain.from_iterable(grouped_text.get_group(num))), return_counts=True)
    sorted_indices = np.argsort(-counts)
    frequent_words_df = frequent_words_df.append({"values":values[sorted_indices], "counts":counts[sorted_indices], "cluster_id": num}, ignore_index=True)
In [45]:
#To view head of frequenct value dataframe
frequent_words_df.head()
Out[45]:
cluster_id values counts
0 0 [movie, original, love, like, good, great, see... [1141, 819, 503, 432, 425, 399, 290, 270, 246,...
1 1 [not, do, movie, original, like, be, good, voi... [915, 611, 481, 428, 343, 268, 228, 205, 186, ...

Frequncy PLOTS to understand Postive reviews and Negative reviews

In [53]:
#Plot to get frequency words in different levels. 

font = {'weight' : 'bold',
        'size'   : 70}

plt.rc('font', **font)

fig = plt.figure(figsize=(100,100))
plt.subplot(2,2,1)
plt.barh(frequent_words_df.loc[0,'values'][:30], frequent_words_df.loc[0,'counts'][:30])
plt.gca().invert_yaxis()


plt.subplot(2,2,2)
plt.barh(frequent_words_df.loc[1,'values'][:30], frequent_words_df.loc[1,'counts'][:30])
plt.gca().invert_yaxis()
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x393485c0>
Out[53]:
<BarContainer object of 30 artists>
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x3a586588>
Out[53]:
<BarContainer object of 30 artists>
In [68]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\wordcamparision.png")
Out[68]:

Clustering Comparing Report

In [71]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\ClusteringComparisionReport.png")
Out[71]:

Business Insight on clustering report

1. Original:

40% of the Audience compared Lion King 2019 movie with Original movie 1994.

62% of Audience gave positive sentiment on Movie while 38% of the audience gave negative sentiment

2. Story:

12% of the audience talked about story of the movie.

69% of audience are happy with story line while 31% of audience are not so happy.

3. Remake:

13% of audience commented about remake version.

61% of audience felt excited about remake version while 39% of audience expectation was not met.

4. Voice Over :

11% of audience expressed about Voice over in the movie.

53% of audience felt pleasant. 47% of audience expressed it as disguesting.

Especially female voice over character should have been better.

5. Disney - Comment on production house

10% of audience wrote comment on Disney.

60% approve of the Disney's make while 40% disapprove of it.

6. Animation

90% of audience who wrote comment expressed WOW feel for animation.

However 10% felt it could have been better.

Overall, Thumbsup for Animation !!

7. Animated Charcters - CGI - Computer Generated Imaginery

8% of audience expressed about CGI - Animated characters.

70% of audience are exclaimed about how it was made.

30% of audience expected level did not meet.

8. Emotional IQ

10% of the audience explained about emotional mix in the movie.

70% of audience felt good as movie portary Pride, Bravery and life cycle etc.

30% of audience felt scary for little kids, rude in few of the scenes.

9. Not - Strong Negation

28% of audience expressed strong negative sentiments on movie.

10. Scene

5% of audience felt few of the scenes are scary, brave, emotional.

8. Train Test split to build models

8A. Train test split for TFIDF vectors - CleanText Lemma

In [26]:
# Train Test split_TV - Cleantext Lemma
TV_Mat_clntxt = TV_Mat_clntxt.sample(frac = 1,random_state=1234)
train_tv_clntext = TV_Mat_clntxt.iloc[:2400]
test_tv_clntext = TV_Mat_clntxt.iloc[2400:]
In [27]:
#To get train test split for both X and Y

X_train_TV_clntxt=train_tv_clntext.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_train_TV_clntxt=train_tv_clntext.iloc[:,-1]#selecting target col

X_test_TV_clntxt=test_tv_clntext.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_test_TV_clntxt=test_tv_clntext.iloc[:,-1]#selecting target col

8B. Train test split for TFIDF vectors - Non stop word removed Text

In [28]:
# Train Test split_TV - Non stop word removal word
TV_Mat_nsw = TV_Mat_nsw.sample(frac = 1,random_state=4321)
train_tv_nsw = TV_Mat_nsw.iloc[:2400]
test_tv_nsw = TV_Mat_nsw.iloc[2400:]
In [29]:
#To get train test split for both X and Y

X_train_TV_nsw=train_tv_nsw.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_train_TV_nsw=train_tv_nsw.iloc[:,-1]#selecting target col

X_test_TV_nsw=test_tv_nsw.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_test_TV_nsw=test_tv_nsw.iloc[:,-1]#selecting target col

8c. Train Test split to build models for count vectorizer

In [30]:
# Train Test split_count vectorizer
Mat = Mat.sample(frac = 1,random_state=1234)
train_cv = Mat.iloc[:2400]
test_cv = Mat.iloc[2400:]
In [31]:
#To get train test split for both X and Y
X_train_CV=train_cv.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_train_CV=train_cv.iloc[:,-1]#selecting target col

X_test_CV=test_cv.iloc[:,:-1]#from col1 to collast, except last one (Excl target) slicing
Y_test_CV=test_cv.iloc[:,-1]#selecting target col

9. Model Building using Count Vectorized word

9A. Logistic Regression Model

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score

logreg = LogisticRegression()
logreg.fit(X_train_CV,Y_train_CV)
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

Out[32]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [33]:
#Predictions on test data
lr_pred_train=logreg.predict(X_train_CV)
lr_pred=logreg.predict(X_test_CV)
In [34]:
# Test data confusion Matrix
confusion_matrix_lr = confusion_matrix(Y_test_CV,lr_pred)
In [35]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\nTrain DATA ACCURACY",accuracy_score(Y_train_CV,lr_pred_train))
print("\n Traindata f1-score for class '1'",f1_score(Y_train_CV,lr_pred_train, average='weighted', pos_label='1' ))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_lr)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_CV,lr_pred))
print("\nTest data f1-score for class '1'",f1_score(Y_test_CV,lr_pred, average='weighted', pos_label='1' ))

--------------------------------------



Train DATA ACCURACY 0.9233333333333333
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\classification.py:1259: UserWarning:

Note that pos_label (set to '1') is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class.

 Traindata f1-score for class '1' 0.9233247063922715


--------------------------------------


TEST Conf Matrix : 
 [[236  55]
 [ 67 242]]

TEST DATA ACCURACY 0.7966666666666666
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\classification.py:1259: UserWarning:

Note that pos_label (set to '1') is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class.

Test data f1-score for class '1' 0.7967073374004068

Logistic classification model give 79.6% F1 score

9B. Naive Bayes Algorithm

In [36]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(X_train_CV,Y_train_CV)

# Predictions on test data
NB_pred=NB.predict(X_train_CV)

# Predictions on test data
NB_pred=NB.predict(X_test_CV)
Out[36]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [39]:
# Test data confusion Matrix
confusion_matrix_NB = confusion_matrix(Y_test_CV,NB_pred)

### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\nTrain DATA ACCURACY",accuracy_score(Y_train_CV,lr_pred_train))
print("\n Traindata f1-score for class '1'",f1_score(Y_train_CV,lr_pred_train, average='weighted', pos_label='1'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_NB)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_CV,NB_pred))
print("\nTest data f1-score for class '1'",f1_score(Y_test_CV,NB_pred, average='weighted' ))

--------------------------------------



Train DATA ACCURACY 0.9233333333333333
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\classification.py:1259: UserWarning:

Note that pos_label (set to '1') is ignored when average != 'binary' (got 'weighted'). You may use labels=[pos_label] to specify a single positive class.

 Traindata f1-score for class '1' 0.9233247063922715


--------------------------------------


TEST Conf Matrix : 
 [[231  60]
 [ 70 239]]

TEST DATA ACCURACY 0.7833333333333333

Test data f1-score for class '1' 0.7833814900426743

Naive Bayes model give 78.3% F1 score

9C. SVM Model

In [40]:
## Build a SVM Classifier
from sklearn.svm import SVC

## Create an SVC object and print it to see the default arguments
svc = SVC()
svc
Out[40]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [41]:
## Fit
svc_cv = svc.fit(X_train_CV, Y_train_CV)
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\svm\base.py:193: FutureWarning:

The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

In [44]:
# Predictions on train data
svm_pred_cv_train=svc_cv.predict(X_train_CV)

# Predictions on test data
svm_pred_cv=svc_cv.predict(X_test_CV)
confusion_matrix_test_svm_cv= confusion_matrix(Y_test_CV, svm_pred_cv)
In [46]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\Train DATA ACCURACY",accuracy_score(Y_train_CV, svm_pred_cv_train))
print("\n Train data f1-score for class '1'",f1_score(Y_train_CV, svm_pred_cv_train, average='weighted'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_test_svm_cv)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_CV, svm_pred_cv))
print("\nTest data f1-score for class '1'",f1_score(Y_test_CV, svm_pred_cv, average='weighted'))

--------------------------------------


\Train DATA ACCURACY 0.5266666666666666

 Train data f1-score for class '1' 0.3909319343556632


--------------------------------------


TEST Conf Matrix : 
 [[286   5]
 [292  17]]

TEST DATA ACCURACY 0.505

Test data f1-score for class '1' 0.3721408084439176

SVM model gives worst accuracy.

9D. Decision Tree Model

In [47]:
from sklearn.tree import DecisionTreeClassifier
Dtc = DecisionTreeClassifier()
In [48]:
# Build Model
%time Dtc.fit(X_train_CV, Y_train_CV)
Wall time: 2.04 s
Out[48]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [50]:
# Predictions on test data
DTC_pred_CV_train=Dtc.predict(X_train_CV)
DTC_pred_CV=Dtc.predict(X_test_CV)
confusion_matrix_test_dtc_cv= confusion_matrix(Y_test_CV,DTC_pred_CV)
In [51]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\Train DATA ACCURACY",accuracy_score(Y_train_CV, DTC_pred_CV_train))
print("\n Train data f1-score for class '1'",f1_score(Y_train_CV, DTC_pred_CV_train, average='weighted'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_test_dtc_cv)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_CV,DTC_pred_CV))
print("\nTest data f1-score for class '1'",f1_score(Y_test_CV,DTC_pred_CV, average='weighted'))

--------------------------------------


\Train DATA ACCURACY 0.9966666666666667

 Train data f1-score for class '1' 0.9966666157364612


--------------------------------------


TEST Conf Matrix : 
 [[227  64]
 [133 176]]

TEST DATA ACCURACY 0.6716666666666666

Test data f1-score for class '1' 0.6684338512418894

9E. Random Forest Model

In [52]:
from sklearn.ensemble import RandomForestClassifier
Rf = RandomForestClassifier()
In [54]:
# Build Model
%time Rf.fit(X_train_CV, Y_train_CV)
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

Wall time: 830 ms
Out[54]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [55]:
# Predictions on test data
RF_pred_CV_train=Dtc.predict(X_train_CV)
RF_pred_CV=Dtc.predict(X_test_CV)
confusion_matrix_test_rf_cv= confusion_matrix(Y_test_CV,DTC_pred_CV)
In [56]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\Train DATA ACCURACY",accuracy_score(Y_train_CV, RF_pred_CV_train))
print("\n Train data f1-score for class '1'",f1_score(Y_train_CV, RF_pred_CV_train, average='weighted'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_test_rf_cv)
print("\Train DATA ACCURACY",accuracy_score(Y_test_CV, RF_pred_CV))
print("\n Train data f1-score for class '1'",f1_score(Y_test_CV, RF_pred_CV, average='weighted'))

--------------------------------------


\Train DATA ACCURACY 0.9966666666666667

 Train data f1-score for class '1' 0.9966666157364612


--------------------------------------


TEST Conf Matrix : 
 [[227  64]
 [133 176]]
\Train DATA ACCURACY 0.6716666666666666

 Train data f1-score for class '1' 0.6684338512418894

9F. Gradient Boosting Ensemble Model

In [57]:
from sklearn.ensemble import GradientBoostingClassifier
In [58]:
Gbm = GradientBoostingClassifier()
In [59]:
# Build Model
%time Gbm.fit(X_train_CV, Y_train_CV)
Wall time: 1min 8s
Out[59]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [61]:
# Predictions on test data
Gbm_pred_CV_train=Gbm.predict(X_train_CV)
Gbm_pred_CV=Gbm.predict(X_test_CV)
confusion_matrix_test_gbm_cv= confusion_matrix(Y_test_CV,Gbm_pred_CV)
In [62]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("\Train DATA ACCURACY",accuracy_score(Y_train_CV, Gbm_pred_CV_train))
print("\n Train data f1-score for class '1'",f1_score(Y_train_CV, Gbm_pred_CV_train, average='weighted'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_test_gbm_cv)
print("\Train DATA ACCURACY",accuracy_score(Y_test_CV, Gbm_pred_CV))
print("\n Train data f1-score for class '1'",f1_score(Y_test_CV, Gbm_pred_CV, average='weighted'))

--------------------------------------


\Train DATA ACCURACY 0.8320833333333333

 Train data f1-score for class '1' 0.8317696891772752


--------------------------------------


TEST Conf Matrix : 
 [[225  66]
 [ 75 234]]
\Train DATA ACCURACY 0.765

 Train data f1-score for class '1' 0.7650528868995525

10. Model Building using TFIDF vectorizer

10A. Logistic Regression Model

In [20]:
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from matplotlib import pyplot
In [54]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.4)
logreg.fit(X_train_TV_nsw,Y_train_TV_nsw)
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

Out[54]:
LogisticRegression(C=0.4, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [55]:
#Predictions on train data
lr_pred_tv_train_nsw=logreg.predict(X_train_TV_nsw)
#Predictions on test data
lr_pred_tv_test_nsw=logreg.predict(X_test_TV_nsw)
In [56]:
# Train data confusion Matrix
confusion_matrix_lr_tv_train_nsw = confusion_matrix(Y_train_TV_nsw,lr_pred_tv_train_nsw)
# Test data confusion Matrix
confusion_matrix_lr_tv_test_nsw = confusion_matrix(Y_test_TV_nsw,lr_pred_tv_test_nsw)
In [57]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_lr_tv_train_nsw)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_nsw,lr_pred_tv_train_nsw))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_nsw,lr_pred_tv_train_nsw, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_lr_tv_test_nsw)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_nsw,lr_pred_tv_test_nsw))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_nsw,lr_pred_tv_test_nsw, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1037  170]
 [  98 1095]]

Train DATA ACCURACY 0.8883333333333333

Train data f1-score for class '1' 0.8882523276904544


--------------------------------------


TEST Conf Matrix : 
 [[209  84]
 [ 37 270]]

TEST DATA ACCURACY 0.7983333333333333

Test data f1-score for class '1' 0.7967105087118055
In [58]:
# predict probabilities
probs = logreg.predict_proba(X_test_TV_nsw)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
yhat = logreg.predict(X_test_TV_nsw)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(Y_test_TV_nsw, probs, pos_label='1')
# calculate F1 score
f1 = f1_score(Y_test_TV_nsw, yhat, average='weighted')
In [59]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# calculate roc curve
fpr, tpr, thresholds = roc_curve(Y_test_TV_nsw, probs, pos_label='1')
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the precision-recall curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
Out[59]:
[<matplotlib.lines.Line2D at 0x236a4e48>]
Out[59]:
[<matplotlib.lines.Line2D at 0x22dbe390>]

Logistic on clean Text

In [21]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.5)
logreg.fit(X_train_TV_clntxt,Y_train_TV_clntxt)
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

Out[21]:
LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
In [22]:
#Predictions on train data
lr_pred_tv_train_clntxt =logreg.predict(X_train_TV_clntxt)
#Predictions on test data
lr_pred_tv_test_clntxt=logreg.predict(X_test_TV_clntxt)
In [23]:
# Train data confusion Matrix
confusion_matrix_lr_tv_train_clntxt = confusion_matrix(Y_train_TV_clntxt,lr_pred_tv_train_clntxt)
# Test data confusion Matrix
confusion_matrix_lr_tv_test_clntxt = confusion_matrix(Y_test_TV_clntxt,lr_pred_tv_test_clntxt)
In [24]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_lr_tv_train_clntxt)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,lr_pred_tv_train_clntxt))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,lr_pred_tv_train_clntxt, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_lr_tv_test_clntxt)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,lr_pred_tv_test_clntxt))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,lr_pred_tv_test_clntxt, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1036  173]
 [ 107 1084]]

Train DATA ACCURACY 0.8833333333333333

Train data f1-score for class '1' 0.8832691409897291


--------------------------------------


TEST Conf Matrix : 
 [[223  68]
 [ 39 270]]

TEST DATA ACCURACY 0.8216666666666667

Test data f1-score for class '1' 0.8209873082330188
In [129]:
lr_pred_unseen = logreg.predict(TV_Mat_unseen)

10B. Naive Bayes Model

In [25]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB(alpha=0.4, fit_prior=True)
NB.fit(X_train_TV_clntxt,Y_train_TV_clntxt)

# Predictions on train data
NB_pred_train=NB.predict(X_train_TV_clntxt)
confusion_matrix_NB_train= confusion_matrix(Y_train_TV_clntxt,NB_pred_train)

# Predictions on test data
NB_pred_test=NB.predict(X_test_TV_clntxt)
confusion_matrix_NB_test= confusion_matrix(Y_test_TV_clntxt,NB_pred_test)
Out[25]:
MultinomialNB(alpha=0.4, class_prior=None, fit_prior=True)
In [26]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_NB_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,NB_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,NB_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_NB_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,NB_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,NB_pred_test, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1099  110]
 [  95 1096]]

Train DATA ACCURACY 0.9145833333333333

Train data f1-score for class '1' 0.9145840006520845


--------------------------------------


TEST Conf Matrix : 
 [[231  60]
 [ 54 255]]

TEST DATA ACCURACY 0.81

Test data f1-score for class '1' 0.8099238782051281
In [130]:
NB_pred_unseen=NB.predict(TV_Mat_unseen)

10C. Decision Tree model

In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
In [28]:
Dtc = DecisionTreeClassifier()

def dtc_params_best(X, y, nfolds):
    criterion = ['entropy', 'gini']
    max_depth = [6,8,10,12]
    min_samples_split = [2,4,6,8,10,20]
    min_samples_leaf = [2, 4, 6]
    param_grid = {'criterion': ['entropy', 'gini'], 'max_depth': [6,8,10,12], 'min_samples_split': [2, 10, 20], 'min_samples_leaf': [2, 4, 6]}
    grid_search = GridSearchCV(Dtc, param_grid,cv=nfolds)
    grid_search.fit(X, y)
    grid_search.best_params_
    
    return grid_search.best_params_
In [86]:
val=dtc_params_best(X_train_TV_clntxt, Y_train_TV_clntxt,5)
In [87]:
val
Out[87]:
{'criterion': 'gini',
 'max_depth': 12,
 'min_samples_leaf': 2,
 'min_samples_split': 2}
In [29]:
Dtc_val = DecisionTreeClassifier(criterion ='gini', max_depth= 12, min_samples_leaf =2, min_samples_split= 2)
Dtc_val.fit(X_train_TV_clntxt,Y_train_TV_clntxt)
Out[29]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=12,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
In [30]:
dtc_pred_train = Dtc_val.predict(X_train_TV_clntxt)
dtc_pred_test = Dtc_val.predict(X_test_TV_clntxt)

print(Dtc_val.score(X_train_TV_clntxt, Y_train_TV_clntxt))
print(Dtc_val.score(X_test_TV_clntxt, Y_test_TV_clntxt))

confusion_matrix_DT_train =confusion_matrix(Y_train_TV_clntxt, dtc_pred_train)

confusion_matrix_DT_test= confusion_matrix(Y_test_TV_clntxt, dtc_pred_test)
0.7941666666666667
0.705
In [31]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_DT_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,dtc_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,dtc_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_DT_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,dtc_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,dtc_pred_test, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[ 771  438]
 [  56 1135]]

Train DATA ACCURACY 0.7941666666666667

Train data f1-score for class '1' 0.7890808038519386


--------------------------------------


TEST Conf Matrix : 
 [[156 135]
 [ 42 267]]

TEST DATA ACCURACY 0.705

Test data f1-score for class '1' 0.6962411017058838
In [136]:
dtc_pred_unseen = Dtc_val.predict(TV_Mat_unseen)

10D. Random Forest Model

In [132]:
from sklearn.ensemble import RandomForestClassifier
Rf = RandomForestClassifier(n_estimators=500, max_depth=12, criterion='entropy')

Rf.fit(X_train_TV_clntxt,Y_train_TV_clntxt)
Out[132]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=12, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [133]:
rf_pred_train = Rf.predict(X_train_TV_clntxt)
rf_pred_test = Rf.predict(X_test_TV_clntxt)


confusion_matrix_RF_train =confusion_matrix(Y_train_TV_clntxt, rf_pred_train)

confusion_matrix_RF_test= confusion_matrix(Y_test_TV_clntxt, rf_pred_test)
In [134]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_RF_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,rf_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,rf_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_RF_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,rf_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,rf_pred_test, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1102  107]
 [ 259  932]]

Train DATA ACCURACY 0.8475

Train data f1-score for class '1' 0.8468124182094188


--------------------------------------


TEST Conf Matrix : 
 [[234  57]
 [ 93 216]]

TEST DATA ACCURACY 0.75

Test data f1-score for class '1' 0.7495495946351717
In [76]:
from sklearn.ensemble import RandomForestClassifier
Rf = RandomForestClassifier(n_estimators=500, max_depth=12, criterion='gini')

Rf.fit(X_train_TV_nsw,Y_train_TV_nsw)
Out[76]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=12, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [117]:
rf_pred_train = Rf.predict(X_train_TV_nsw)
rf_pred_test = Rf.predict(X_test_TV_nsw)


confusion_matrix_RF_train =confusion_matrix(Y_train_TV_nsw, rf_pred_train)

confusion_matrix_RF_test= confusion_matrix(Y_test_TV_nsw, rf_pred_test)
In [119]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_RF_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_nsw,rf_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_nsw,rf_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_RF_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_nsw,rf_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_nsw,rf_pred_test, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1126   81]
 [ 263  930]]

Train DATA ACCURACY 0.8566666666666667

Train data f1-score for class '1' 0.8557730353459228


--------------------------------------


TEST Conf Matrix : 
 [[222  71]
 [ 83 224]]

TEST DATA ACCURACY 0.7433333333333333

Test data f1-score for class '1' 0.7433504446345701
In [135]:
rf_pred_unseen = Rf.predict(TV_Mat_unseen)

10E. SVM Model

In [76]:
## Build a SVM Classifier
from sklearn.svm import SVC, LinearSVC
In [101]:
## Create an SVC object and print it to see the default arguments
svc = LinearSVC()
svc
Out[101]:
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
In [102]:
## Fit
svc_tv = svc.fit(X_train_TV_clntxt, Y_train_TV_clntxt)
In [103]:
# Predictions on test data
svm_pred_tv_train=svc_tv.predict(X_train_TV_clntxt)
svm_pred_tv_test=svc_tv.predict(X_test_TV_clntxt)
confusion_matrix_train_svm= confusion_matrix(Y_train_TV_clntxt,svm_pred_tv_train)
confusion_matrix_test_svm= confusion_matrix(Y_test_TV_clntxt,svm_pred_tv_test)
In [104]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_train_svm)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,svm_pred_tv_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,svm_pred_tv_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_test_svm)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,svm_pred_tv))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,svm_pred_tv, average='weighted'))

--------------------------------------


TEST Conf Matrix : 
 [[1179   30]
 [  24 1167]]

Train DATA ACCURACY 0.9775

Train data f1-score for class '1' 0.9775002812570316


--------------------------------------


TEST Conf Matrix : 
 [[229  62]
 [ 58 251]]

TEST DATA ACCURACY 0.485
C:\Users\Lokesh\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning:

F-score is ill-defined and being set to 0.0 in labels with no predicted samples.

Test data f1-score for class '1' 0.3168013468013468

10F. Gradient Boosting Ensemble Model

In [ ]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [36]:
from sklearn.ensemble import GradientBoostingClassifier

GBM = GradientBoostingClassifier(learning_rate=0.12, max_depth=3, n_estimators=200, max_features=0.2, subsample=0.8)

%time GBM.fit(X_train_TV_clntxt,Y_train_TV_clntxt)
Wall time: 25 s
Out[36]:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.12, loss='deviance', max_depth=3,
                           max_features=0.2, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=200,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=0.8, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [37]:
GBM_pred_train = GBM.predict(X_train_TV_clntxt)
GBM_pred_test = GBM.predict(X_test_TV_clntxt)


confusion_matrix_GBM_train =confusion_matrix(Y_train_TV_clntxt, GBM_pred_train)

confusion_matrix_GBM_test= confusion_matrix(Y_test_TV_clntxt, GBM_pred_test)
In [38]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("Train Conf Matrix : \n", confusion_matrix_GBM_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,GBM_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,GBM_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_GBM_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,GBM_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,GBM_pred_test, average='weighted'))

--------------------------------------


Train Conf Matrix : 
 [[1093  116]
 [  71 1120]]

Train DATA ACCURACY 0.9220833333333334

Train data f1-score for class '1' 0.922066895706071


--------------------------------------


TEST Conf Matrix : 
 [[212  79]
 [ 50 259]]

TEST DATA ACCURACY 0.785

Test data f1-score for class '1' 0.7841809603930786
In [239]:
# predict probabilities
probs = logreg.predict_proba(X_test_TV_clntxt)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
yhat = logreg.predict(X_test_TV_clntxt)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(Y_test_TV_clntxt, probs, pos_label='1')
# calculate F1 score
f1 = f1_score(Y_test_TV_clntxt, yhat, average='weighted')
In [240]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# calculate roc curve
fpr, tpr, thresholds = roc_curve(Y_test_TV_clntxt, probs, pos_label='1')
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the precision-recall curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
Out[240]:
[<matplotlib.lines.Line2D at 0x23749c50>]
Out[240]:
[<matplotlib.lines.Line2D at 0x23b08278>]
In [118]:
#Converting to dictory 
Unseen_test = Unseen_test.reset_index().to_dict(orient='list')
In [119]:
## We are trying to create four seperate lists for text with stop words, text without stop words,
## text with stemmed words and text with lemmatized words.

## Naming Conventions followed ####

## 'clean' word is appended to lists which do not contain stopwords

## 'all' keyword is appended to lists which contain stopwords.

## use extend so it's a big flat list of vocab

Unseen_test['clean_text_stemmed'] = []
Unseen_test['clean_text_lemmatized'] = []
Unseen_test['text_stemmed'] = []
Unseen_test['text_lemmatized'] = []

vocab_stemmed = []

vocab_tokenized = []
allvocab_tokenized = []

vocab_lemmatized = []
allvocab_lemmatized = []


for idx,text in enumerate(Unseen_test['review']):

## first convert the entire text into spacy document type
#     print(f"The type of text is {type(text)} and text is {text}")
#     print(f"The type of idx is {type(idx)} and idx is {idx}")
    doc = nlp(text)
    print(f"processing {idx} document")
    words_stemmed = tokenize_and_stem(doc)
    words_lemmatized = tokenize_and_lemmatize(doc)
    vocab_stemmed.extend(words_stemmed)
    vocab_lemmatized.extend(words_lemmatized)
    
    Unseen_test['clean_text_stemmed'].append(words_stemmed)
    Unseen_test['clean_text_lemmatized'].append(words_lemmatized)
    
    allwords_stemmed = tokenize_and_stem(doc, False) 
    allwords_lemmatized = tokenize_and_lemmatize(doc, False)
    allvocab_lemmatized.extend(allwords_lemmatized)
    
    Unseen_test['text_stemmed'].append(allwords_stemmed)
    Unseen_test['text_lemmatized'].append(allwords_lemmatized)
    
    allwords_tokenized = tokenize_only(doc,False)
    allvocab_tokenized.extend(allwords_tokenized)
    
    words_tokenized = tokenize_only(doc)
    vocab_tokenized.extend(words_tokenized)
processing 0 document
processing 1 document
processing 2 document
processing 3 document
processing 4 document
processing 5 document
processing 6 document
processing 7 document
processing 8 document
processing 9 document
processing 10 document
processing 11 document
processing 12 document
processing 13 document
processing 14 document
processing 15 document
processing 16 document
processing 17 document
processing 18 document
processing 19 document
processing 20 document
processing 21 document
processing 22 document
processing 23 document
processing 24 document
processing 25 document
processing 26 document
processing 27 document
processing 28 document
processing 29 document
processing 30 document
processing 31 document
processing 32 document
processing 33 document
processing 34 document
processing 35 document
processing 36 document
processing 37 document
processing 38 document
processing 39 document
processing 40 document
processing 41 document
processing 42 document
processing 43 document
processing 44 document
processing 45 document
processing 46 document
processing 47 document
processing 48 document
processing 49 document
processing 50 document
processing 51 document
processing 52 document
processing 53 document
processing 54 document
processing 55 document
processing 56 document
processing 57 document
processing 58 document
processing 59 document
processing 60 document
processing 61 document
processing 62 document
processing 63 document
processing 64 document
processing 65 document
processing 66 document
processing 67 document
processing 68 document
processing 69 document
processing 70 document
processing 71 document
processing 72 document
processing 73 document
processing 74 document
processing 75 document
processing 76 document
processing 77 document
processing 78 document
processing 79 document
processing 80 document
processing 81 document
processing 82 document
processing 83 document
processing 84 document
processing 85 document
processing 86 document
processing 87 document
processing 88 document
processing 89 document
processing 90 document
processing 91 document
processing 92 document
processing 93 document
processing 94 document
processing 95 document
processing 96 document
processing 97 document
processing 98 document
processing 99 document
processing 100 document
processing 101 document
processing 102 document
processing 103 document
processing 104 document
processing 105 document
processing 106 document
processing 107 document
processing 108 document
processing 109 document
processing 110 document
processing 111 document
processing 112 document
processing 113 document
processing 114 document
processing 115 document
processing 116 document
processing 117 document
processing 118 document
processing 119 document
processing 120 document
processing 121 document
processing 122 document
processing 123 document
processing 124 document
processing 125 document
processing 126 document
processing 127 document
processing 128 document
processing 129 document
processing 130 document
processing 131 document
processing 132 document
processing 133 document
processing 134 document
processing 135 document
processing 136 document
processing 137 document
processing 138 document
processing 139 document
processing 140 document
processing 141 document
processing 142 document
processing 143 document
processing 144 document
processing 145 document
processing 146 document
processing 147 document
processing 148 document
processing 149 document
processing 150 document
processing 151 document
processing 152 document
processing 153 document
processing 154 document
processing 155 document
processing 156 document
processing 157 document
processing 158 document
processing 159 document
processing 160 document
processing 161 document
processing 162 document
processing 163 document
processing 164 document
processing 165 document
processing 166 document
processing 167 document
processing 168 document
processing 169 document
processing 170 document
processing 171 document
processing 172 document
processing 173 document
processing 174 document
processing 175 document
processing 176 document
processing 177 document
processing 178 document
processing 179 document
processing 180 document
processing 181 document
processing 182 document
processing 183 document
processing 184 document
processing 185 document
processing 186 document
processing 187 document
processing 188 document
processing 189 document
processing 190 document
processing 191 document
processing 192 document
processing 193 document
processing 194 document
processing 195 document
processing 196 document
processing 197 document
processing 198 document
processing 199 document
processing 200 document
processing 201 document
processing 202 document
processing 203 document
processing 204 document
processing 205 document
processing 206 document
processing 207 document
processing 208 document
processing 209 document
processing 210 document
processing 211 document
processing 212 document
processing 213 document
processing 214 document
processing 215 document
processing 216 document
processing 217 document
processing 218 document
processing 219 document
processing 220 document
processing 221 document
processing 222 document
processing 223 document
processing 224 document
processing 225 document
processing 226 document
processing 227 document
processing 228 document
processing 229 document
processing 230 document
processing 231 document
processing 232 document
processing 233 document
processing 234 document
processing 235 document
processing 236 document
processing 237 document
processing 238 document
processing 239 document
processing 240 document
processing 241 document
processing 242 document
processing 243 document
processing 244 document
processing 245 document
processing 246 document
processing 247 document
processing 248 document
processing 249 document
processing 250 document
processing 251 document
processing 252 document
processing 253 document
processing 254 document
processing 255 document
processing 256 document
processing 257 document
processing 258 document
processing 259 document
processing 260 document
processing 261 document
processing 262 document
processing 263 document
processing 264 document
processing 265 document
processing 266 document
processing 267 document
processing 268 document
processing 269 document
processing 270 document
processing 271 document
processing 272 document
processing 273 document
processing 274 document
processing 275 document
processing 276 document
processing 277 document
processing 278 document
processing 279 document
processing 280 document
processing 281 document
processing 282 document
processing 283 document
processing 284 document
processing 285 document
processing 286 document
processing 287 document
processing 288 document
processing 289 document
processing 290 document
processing 291 document
processing 292 document
processing 293 document
processing 294 document
processing 295 document
processing 296 document
processing 297 document
processing 298 document
processing 299 document
processing 300 document
processing 301 document
processing 302 document
processing 303 document
processing 304 document
processing 305 document
processing 306 document
processing 307 document
processing 308 document
processing 309 document
processing 310 document
processing 311 document
processing 312 document
processing 313 document
processing 314 document
processing 315 document
processing 316 document
processing 317 document
processing 318 document
processing 319 document
processing 320 document
processing 321 document
processing 322 document
processing 323 document
processing 324 document
processing 325 document
processing 326 document
processing 327 document
processing 328 document
processing 329 document
processing 330 document
processing 331 document
processing 332 document
processing 333 document
processing 334 document
processing 335 document
processing 336 document
processing 337 document
processing 338 document
processing 339 document
processing 340 document
processing 341 document
processing 342 document
processing 343 document
processing 344 document
processing 345 document
processing 346 document
processing 347 document
processing 348 document
processing 349 document
processing 350 document
processing 351 document
processing 352 document
processing 353 document
processing 354 document
processing 355 document
processing 356 document
processing 357 document
processing 358 document
processing 359 document
processing 360 document
processing 361 document
processing 362 document
processing 363 document
processing 364 document
processing 365 document
processing 366 document
processing 367 document
processing 368 document
processing 369 document
processing 370 document
processing 371 document
processing 372 document
processing 373 document
processing 374 document
processing 375 document
processing 376 document
processing 377 document
processing 378 document
processing 379 document
processing 380 document
processing 381 document
processing 382 document
processing 383 document
processing 384 document
processing 385 document
processing 386 document
processing 387 document
processing 388 document
processing 389 document
processing 390 document
processing 391 document
processing 392 document
processing 393 document
processing 394 document
processing 395 document
processing 396 document
processing 397 document
processing 398 document
processing 399 document
processing 400 document
processing 401 document
processing 402 document
processing 403 document
processing 404 document
processing 405 document
processing 406 document
processing 407 document
processing 408 document
processing 409 document
processing 410 document
processing 411 document
processing 412 document
processing 413 document
processing 414 document
processing 415 document
processing 416 document
processing 417 document
processing 418 document
processing 419 document
processing 420 document
processing 421 document
processing 422 document
processing 423 document
processing 424 document
processing 425 document
processing 426 document
processing 427 document
processing 428 document
processing 429 document
processing 430 document
processing 431 document
processing 432 document
processing 433 document
processing 434 document
processing 435 document
processing 436 document
processing 437 document
processing 438 document
processing 439 document
processing 440 document
processing 441 document
processing 442 document
processing 443 document
processing 444 document
processing 445 document
processing 446 document
processing 447 document
processing 448 document
processing 449 document
processing 450 document
processing 451 document
processing 452 document
processing 453 document
processing 454 document
processing 455 document
processing 456 document
processing 457 document
processing 458 document
processing 459 document
processing 460 document
processing 461 document
processing 462 document
processing 463 document
processing 464 document
processing 465 document
processing 466 document
processing 467 document
processing 468 document
processing 469 document
processing 470 document
processing 471 document
processing 472 document
processing 473 document
processing 474 document
processing 475 document
processing 476 document
processing 477 document
processing 478 document
processing 479 document
processing 480 document
processing 481 document
processing 482 document
processing 483 document
processing 484 document
processing 485 document
processing 486 document
processing 487 document
processing 488 document
processing 489 document
processing 490 document
processing 491 document
processing 492 document
processing 493 document
processing 494 document
processing 495 document
processing 496 document
processing 497 document
processing 498 document
processing 499 document
processing 500 document
processing 501 document
processing 502 document
processing 503 document
processing 504 document
processing 505 document
processing 506 document
processing 507 document
processing 508 document
processing 509 document
processing 510 document
processing 511 document
processing 512 document
processing 513 document
processing 514 document
processing 515 document
processing 516 document
processing 517 document
processing 518 document
processing 519 document
processing 520 document
processing 521 document
processing 522 document
processing 523 document
processing 524 document
processing 525 document
processing 526 document
processing 527 document
processing 528 document
processing 529 document
processing 530 document
processing 531 document
processing 532 document
processing 533 document
processing 534 document
processing 535 document
processing 536 document
processing 537 document
processing 538 document
processing 539 document
processing 540 document
processing 541 document
processing 542 document
processing 543 document
processing 544 document
processing 545 document
processing 546 document
processing 547 document
processing 548 document
processing 549 document
processing 550 document
processing 551 document
processing 552 document
processing 553 document
processing 554 document
processing 555 document
processing 556 document
processing 557 document
processing 558 document
processing 559 document
processing 560 document
processing 561 document
processing 562 document
processing 563 document
processing 564 document
processing 565 document
processing 566 document
processing 567 document
processing 568 document
processing 569 document
processing 570 document
processing 571 document
processing 572 document
processing 573 document
processing 574 document
processing 575 document
processing 576 document
processing 577 document
processing 578 document
processing 579 document
processing 580 document
processing 581 document
processing 582 document
processing 583 document
processing 584 document
processing 585 document
processing 586 document
processing 587 document
processing 588 document
processing 589 document
processing 590 document
processing 591 document
processing 592 document
processing 593 document
processing 594 document
processing 595 document
processing 596 document
processing 597 document
processing 598 document
processing 599 document
processing 600 document
processing 601 document
processing 602 document
processing 603 document
processing 604 document
processing 605 document
processing 606 document
processing 607 document
processing 608 document
processing 609 document
processing 610 document
processing 611 document
processing 612 document
processing 613 document
processing 614 document
processing 615 document
processing 616 document
processing 617 document
processing 618 document
processing 619 document
processing 620 document
processing 621 document
processing 622 document
processing 623 document
processing 624 document
processing 625 document
processing 626 document
processing 627 document
processing 628 document
processing 629 document
processing 630 document
processing 631 document
processing 632 document
processing 633 document
processing 634 document
processing 635 document
processing 636 document
processing 637 document
processing 638 document
processing 639 document
processing 640 document
processing 641 document
processing 642 document
processing 643 document
processing 644 document
processing 645 document
processing 646 document
processing 647 document
processing 648 document
processing 649 document
processing 650 document
processing 651 document
processing 652 document
processing 653 document
processing 654 document
processing 655 document
processing 656 document
processing 657 document
processing 658 document
processing 659 document
processing 660 document
processing 661 document
processing 662 document
processing 663 document
processing 664 document
processing 665 document
processing 666 document
processing 667 document
processing 668 document
processing 669 document
processing 670 document
processing 671 document
processing 672 document
processing 673 document
processing 674 document
processing 675 document
processing 676 document
processing 677 document
processing 678 document
processing 679 document
processing 680 document
processing 681 document
processing 682 document
processing 683 document
processing 684 document
processing 685 document
processing 686 document
processing 687 document
processing 688 document
processing 689 document
processing 690 document
processing 691 document
processing 692 document
processing 693 document
processing 694 document
processing 695 document
processing 696 document
processing 697 document
processing 698 document
processing 699 document
processing 700 document
processing 701 document
processing 702 document
processing 703 document
processing 704 document
processing 705 document
processing 706 document
processing 707 document
processing 708 document
processing 709 document
processing 710 document
processing 711 document
processing 712 document
processing 713 document
processing 714 document
processing 715 document
processing 716 document
processing 717 document
processing 718 document
processing 719 document
processing 720 document
processing 721 document
processing 722 document
processing 723 document
processing 724 document
processing 725 document
processing 726 document
processing 727 document
processing 728 document
processing 729 document
processing 730 document
processing 731 document
processing 732 document
processing 733 document
processing 734 document
processing 735 document
processing 736 document
processing 737 document
processing 738 document
processing 739 document
processing 740 document
processing 741 document
processing 742 document
processing 743 document
processing 744 document
processing 745 document
processing 746 document
processing 747 document
processing 748 document
processing 749 document
processing 750 document
processing 751 document
processing 752 document
processing 753 document
processing 754 document
processing 755 document
processing 756 document
processing 757 document
processing 758 document
processing 759 document
processing 760 document
processing 761 document
processing 762 document
processing 763 document
processing 764 document
processing 765 document
processing 766 document
processing 767 document
processing 768 document
processing 769 document
processing 770 document
processing 771 document
processing 772 document
processing 773 document
processing 774 document
processing 775 document
processing 776 document
processing 777 document
processing 778 document
processing 779 document
processing 780 document
processing 781 document
processing 782 document
processing 783 document
processing 784 document
processing 785 document
processing 786 document
processing 787 document
processing 788 document
processing 789 document
processing 790 document
processing 791 document
processing 792 document
processing 793 document
processing 794 document
processing 795 document
processing 796 document
processing 797 document
processing 798 document
processing 799 document
processing 800 document
processing 801 document
processing 802 document
processing 803 document
processing 804 document
processing 805 document
processing 806 document
processing 807 document
processing 808 document
processing 809 document
processing 810 document
processing 811 document
processing 812 document
processing 813 document
processing 814 document
processing 815 document
processing 816 document
processing 817 document
processing 818 document
processing 819 document
processing 820 document
processing 821 document
processing 822 document
processing 823 document
processing 824 document
processing 825 document
processing 826 document
processing 827 document
processing 828 document
processing 829 document
processing 830 document
processing 831 document
processing 832 document
processing 833 document
processing 834 document
processing 835 document
processing 836 document
processing 837 document
processing 838 document
processing 839 document
processing 840 document
processing 841 document
processing 842 document
processing 843 document
processing 844 document
processing 845 document
processing 846 document
processing 847 document
processing 848 document
processing 849 document
processing 850 document
processing 851 document
processing 852 document
processing 853 document
processing 854 document
processing 855 document
processing 856 document
processing 857 document
processing 858 document
processing 859 document
processing 860 document
processing 861 document
processing 862 document
processing 863 document
processing 864 document
processing 865 document
processing 866 document
processing 867 document
processing 868 document
processing 869 document
processing 870 document
processing 871 document
processing 872 document
processing 873 document
processing 874 document
processing 875 document
processing 876 document
processing 877 document
processing 878 document
processing 879 document
processing 880 document
processing 881 document
processing 882 document
processing 883 document
processing 884 document
processing 885 document
processing 886 document
processing 887 document
processing 888 document
processing 889 document
processing 890 document
processing 891 document
processing 892 document
processing 893 document
processing 894 document
processing 895 document
processing 896 document
processing 897 document
processing 898 document
processing 899 document
processing 900 document
processing 901 document
processing 902 document
processing 903 document
processing 904 document
processing 905 document
processing 906 document
processing 907 document
processing 908 document
processing 909 document
processing 910 document
processing 911 document
processing 912 document
processing 913 document
processing 914 document
processing 915 document
processing 916 document
processing 917 document
processing 918 document
processing 919 document
processing 920 document
processing 921 document
processing 922 document
processing 923 document
processing 924 document
processing 925 document
processing 926 document
processing 927 document
processing 928 document
processing 929 document
processing 930 document
processing 931 document
processing 932 document
processing 933 document
processing 934 document
processing 935 document
processing 936 document
processing 937 document
processing 938 document
processing 939 document
processing 940 document
processing 941 document
processing 942 document
processing 943 document
processing 944 document
processing 945 document
processing 946 document
processing 947 document
processing 948 document
processing 949 document
processing 950 document
processing 951 document
processing 952 document
processing 953 document
processing 954 document
processing 955 document
processing 956 document
processing 957 document
processing 958 document
processing 959 document
processing 960 document
processing 961 document
processing 962 document
processing 963 document
processing 964 document
processing 965 document
processing 966 document
processing 967 document
processing 968 document
processing 969 document
processing 970 document
processing 971 document
processing 972 document
processing 973 document
processing 974 document
processing 975 document
processing 976 document
processing 977 document
processing 978 document
processing 979 document
processing 980 document
processing 981 document
processing 982 document
processing 983 document
processing 984 document
processing 985 document
processing 986 document
processing 987 document
processing 988 document
processing 989 document
processing 990 document
processing 991 document
processing 992 document
processing 993 document
processing 994 document
processing 995 document
processing 996 document
processing 997 document
processing 998 document
processing 999 document
processing 1000 document
processing 1001 document
processing 1002 document
processing 1003 document
processing 1004 document
processing 1005 document
processing 1006 document
processing 1007 document
processing 1008 document
processing 1009 document
processing 1010 document
processing 1011 document
processing 1012 document
processing 1013 document
processing 1014 document
processing 1015 document
processing 1016 document
processing 1017 document
processing 1018 document
processing 1019 document
processing 1020 document
processing 1021 document
processing 1022 document
processing 1023 document
processing 1024 document
processing 1025 document
processing 1026 document
processing 1027 document
processing 1028 document
processing 1029 document
processing 1030 document
processing 1031 document
processing 1032 document
processing 1033 document
processing 1034 document
processing 1035 document
processing 1036 document
processing 1037 document
processing 1038 document
processing 1039 document
processing 1040 document
processing 1041 document
processing 1042 document
processing 1043 document
processing 1044 document
processing 1045 document
processing 1046 document
processing 1047 document
processing 1048 document
processing 1049 document
processing 1050 document
processing 1051 document
processing 1052 document
processing 1053 document
processing 1054 document
processing 1055 document
processing 1056 document
processing 1057 document
processing 1058 document
processing 1059 document
processing 1060 document
processing 1061 document
processing 1062 document
processing 1063 document
processing 1064 document
processing 1065 document
processing 1066 document
processing 1067 document
processing 1068 document
processing 1069 document
processing 1070 document
processing 1071 document
processing 1072 document
processing 1073 document
processing 1074 document
processing 1075 document
processing 1076 document
processing 1077 document
processing 1078 document
processing 1079 document
processing 1080 document
processing 1081 document
processing 1082 document
processing 1083 document
processing 1084 document
processing 1085 document
processing 1086 document
processing 1087 document
processing 1088 document
processing 1089 document
processing 1090 document
processing 1091 document
processing 1092 document
processing 1093 document
processing 1094 document
processing 1095 document
processing 1096 document
processing 1097 document
processing 1098 document
processing 1099 document
processing 1100 document
processing 1101 document
processing 1102 document
processing 1103 document
processing 1104 document
processing 1105 document
processing 1106 document
processing 1107 document
processing 1108 document
processing 1109 document
processing 1110 document
processing 1111 document
processing 1112 document
processing 1113 document
processing 1114 document
processing 1115 document
processing 1116 document
processing 1117 document
processing 1118 document
processing 1119 document
processing 1120 document
processing 1121 document
processing 1122 document
processing 1123 document
processing 1124 document
processing 1125 document
processing 1126 document
processing 1127 document
processing 1128 document
processing 1129 document
processing 1130 document
processing 1131 document
processing 1132 document
processing 1133 document
processing 1134 document
processing 1135 document
processing 1136 document
processing 1137 document
processing 1138 document
processing 1139 document
processing 1140 document
processing 1141 document
processing 1142 document
processing 1143 document
processing 1144 document
processing 1145 document
processing 1146 document
processing 1147 document
processing 1148 document
processing 1149 document
processing 1150 document
processing 1151 document
processing 1152 document
processing 1153 document
processing 1154 document
processing 1155 document
processing 1156 document
processing 1157 document
processing 1158 document
processing 1159 document
processing 1160 document
processing 1161 document
processing 1162 document
processing 1163 document
processing 1164 document
processing 1165 document
processing 1166 document
processing 1167 document
processing 1168 document
processing 1169 document
processing 1170 document
processing 1171 document
processing 1172 document
processing 1173 document
processing 1174 document
processing 1175 document
processing 1176 document
processing 1177 document
processing 1178 document
processing 1179 document
processing 1180 document
processing 1181 document
processing 1182 document
processing 1183 document
processing 1184 document
processing 1185 document
processing 1186 document
processing 1187 document
processing 1188 document
processing 1189 document
processing 1190 document
processing 1191 document
processing 1192 document
processing 1193 document
processing 1194 document
processing 1195 document
processing 1196 document
processing 1197 document
processing 1198 document
processing 1199 document
In [120]:
## tfidf vectorizer needs sentence and not token. Hence we need to combine all the tokens back to form a string

Unseen_test['clean_text_stemmed'] = [' '.join(text) for text in Unseen_test['clean_text_stemmed']]
Unseen_test['clean_text_lemmatized'] = [' '.join(text) for text in Unseen_test['clean_text_lemmatized']]
In [121]:
# #define vectorizer parameters
# tfidf_vectorizer = TfidfVectorizer(max_df=0.95, max_features=5000,
#                                  min_df=0.001,
#                                  use_idf=True, ngram_range=(1,5))


tfidf_matrix_unseen = tfidf_vectorizer.transform(Unseen_test['clean_text_lemmatized'])

print(tfidf_matrix_unseen.shape)
(1200, 3525)
In [122]:
TV_Matrix_unseen = tfidf_matrix_unseen.todense()
TV_Matrix_unseen
Out[122]:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
In [123]:
#For TFid vector
TV_Mat_unseen = pd.DataFrame(TV_Matrix_unseen)
TV_Mat_unseen.head()
Out[123]:
0 1 2 3 4 5 6 7 8 9 ... 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.250207 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 3525 columns

In [128]:
GBM_pred_unseen = GBM.predict(TV_Mat_unseen)
In [40]:
GBM_pred_unseen = pd.DataFrame(GBM_pred_unseen)
In [43]:
GBM_pred_unseen.head(3)
Out[43]:
0
0 0
1 0
2 0
In [44]:
#Exporting LSTM model output to CSV file. 
GBM_pred_unseen.to_csv("GBM_output.csv")

Experiment with Hyperparameters

In [127]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\Hyperparametertuning.png")
Out[127]:

Idea Generation from HyperParameter experiement

To avoid overfitting - You can tune with below ideas.

Max_features should be low - Typically 0.2-0.3

Subsampling should be high - Typically 0.6-0.8

N-estimators should be low - Typically 100-300

Max_depth should be low - Typically 3-8

Learning rate should be low - Typically 0.1 - 0.01

To increase Training accuracy - You can tune with below ideas.

Max_features should be high - Typically 0.4-0.6

Subsampling should be low - Typically 0.3-0.5

N-estimators should be low - Typically 500-1000

Max_depth should be low - Typically 10-20

Learning rate should be low - Typically 0.1 - 1

10G. XG Boost Model classifier

In [ ]:
!pip install xgboost
In [51]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
In [52]:
eval_set = [(X_train_TV_clntxt, Y_train_TV_clntxt), (X_test_TV_clntxt, Y_test_TV_clntxt)]

# fit model on training data
XG = XGBClassifier(max_depth=8,learning_rate=0.05, n_estimators=100, subsample=0.8, reg_alpha=0.6, reg_lambda=0.6, gamma=10, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
%time XG.fit(X_train_TV_clntxt,Y_train_TV_clntxt)
Wall time: 1min 25s
Out[52]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=10,
              eval_metric='logloss',
              eval_set=[(      0     1     2     3     4     5     6         7     8        9     ...  \
1936   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.000000   0.0  0.00000  ...   
85     0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.000000   0.0  0.00000  ...   
2045   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.000000   0.0  0.00000  ...   
1230   0.0   0.0   0.0   0.0   0.0   0.0   0.0  0.000000   0.0  0....
1194    0
       ..
2041    1
664     0
1318    0
723     0
2863    1
Name: Sentiment, Length: 600, dtype: object)],
              gamma=10, learning_rate=0.05, max_delta_step=0, max_depth=8,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.8, verbose=True, verbosity=1)
In [53]:
XG_pred_train = XG.predict(X_train_TV_clntxt)
XG_pred_test = XG.predict(X_test_TV_clntxt)


confusion_matrix_XG_train =confusion_matrix(Y_train_TV_clntxt, XG_pred_train)

confusion_matrix_XG_test= confusion_matrix(Y_test_TV_clntxt, XG_pred_test)
In [54]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")
confusion_matrix_XG_train
print("Train Conf Matrix : \n", confusion_matrix_XG_train)
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,XG_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,XG_pred_train, average='weighted'))


### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix_XG_test)
print("\nTEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,XG_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,XG_pred_test, average='weighted'))

--------------------------------------


Out[54]:
array([[ 886,  323],
       [ 148, 1043]], dtype=int64)
Train Conf Matrix : 
 [[ 886  323]
 [ 148 1043]]

Train DATA ACCURACY 0.80375

Train data f1-score for class '1' 0.8028098711831237


--------------------------------------


TEST Conf Matrix : 
 [[190 101]
 [ 39 270]]

TEST DATA ACCURACY 0.7666666666666667

Test data f1-score for class '1' 0.763393665158371
In [77]:
XG_pred_train
Out[77]:
array(['1', '0', '1', ..., '1', '0', '1'], dtype=object)

10H. CNN model building

In [102]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Conv1D, MaxPooling1D
In [81]:
input_shape = TV_Mat_clntxt.shape
In [103]:
modelcnn= Sequential()
modelcnn.add(Embedding(20000, 100, input_length=500))
modelcnn.add(Dropout(0.1))
modelcnn.add(Conv1D(128, 5, activation='relu'))
modelcnn.add(MaxPooling1D(pool_size=4))
modelcnn.add(LSTM(100))
modelcnn.add(Dense(2, activation='softmax'))
modelcnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
W0829 21:40:14.629139 12536 deprecation_wrapper.py:119] From C:\Users\Lokesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0829 21:40:50.947216 12536 deprecation_wrapper.py:119] From C:\Users\Lokesh\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0829 21:40:51.024220 12536 deprecation_wrapper.py:119] From C:\Users\Lokesh\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0829 21:40:51.039221 12536 deprecation.py:323] From C:\Users\Lokesh\Anaconda3\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
In [104]:
modelcnn.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 500, 100)          2000000   
_________________________________________________________________
dropout_3 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 496, 128)          64128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 124, 128)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               91600     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 202       
=================================================================
Total params: 2,155,930
Trainable params: 2,155,930
Non-trainable params: 0
_________________________________________________________________
In [108]:
hist = modelcnn.fit(X_train, Y_train, batch_size=32, epochs=5, verbose=1, validation_data=(X_test, Y_test))
Train on 2400 samples, validate on 600 samples
Epoch 1/5
2400/2400 [==============================] - ETA: 7:28 - loss: 0.6965 - acc: 0.375 - ETA: 3:59 - loss: 0.6849 - acc: 0.531 - ETA: 2:47 - loss: 0.7030 - acc: 0.479 - ETA: 2:10 - loss: 0.6941 - acc: 0.523 - ETA: 1:47 - loss: 0.6951 - acc: 0.518 - ETA: 1:32 - loss: 0.6972 - acc: 0.510 - ETA: 1:21 - loss: 0.7013 - acc: 0.486 - ETA: 1:13 - loss: 0.6993 - acc: 0.492 - ETA: 1:06 - loss: 0.6967 - acc: 0.500 - ETA: 1:01 - loss: 0.6939 - acc: 0.509 - ETA: 57s - loss: 0.6931 - acc: 0.508 - ETA: 54s - loss: 0.6923 - acc: 0.50 - ETA: 51s - loss: 0.6905 - acc: 0.51 - ETA: 48s - loss: 0.6891 - acc: 0.52 - ETA: 46s - loss: 0.6883 - acc: 0.52 - ETA: 44s - loss: 0.6878 - acc: 0.53 - ETA: 42s - loss: 0.6862 - acc: 0.54 - ETA: 40s - loss: 0.6857 - acc: 0.54 - ETA: 39s - loss: 0.6839 - acc: 0.55 - ETA: 37s - loss: 0.6849 - acc: 0.54 - ETA: 36s - loss: 0.6839 - acc: 0.55 - ETA: 35s - loss: 0.6814 - acc: 0.56 - ETA: 34s - loss: 0.6773 - acc: 0.57 - ETA: 32s - loss: 0.6751 - acc: 0.57 - ETA: 31s - loss: 0.6787 - acc: 0.57 - ETA: 30s - loss: 0.6757 - acc: 0.57 - ETA: 29s - loss: 0.6747 - acc: 0.57 - ETA: 29s - loss: 0.6762 - acc: 0.57 - ETA: 28s - loss: 0.6741 - acc: 0.57 - ETA: 27s - loss: 0.6733 - acc: 0.57 - ETA: 26s - loss: 0.6714 - acc: 0.57 - ETA: 25s - loss: 0.6711 - acc: 0.57 - ETA: 24s - loss: 0.6711 - acc: 0.57 - ETA: 23s - loss: 0.6714 - acc: 0.58 - ETA: 23s - loss: 0.6712 - acc: 0.58 - ETA: 22s - loss: 0.6687 - acc: 0.58 - ETA: 21s - loss: 0.6683 - acc: 0.58 - ETA: 20s - loss: 0.6681 - acc: 0.58 - ETA: 20s - loss: 0.6690 - acc: 0.58 - ETA: 19s - loss: 0.6660 - acc: 0.59 - ETA: 18s - loss: 0.6659 - acc: 0.59 - ETA: 17s - loss: 0.6644 - acc: 0.59 - ETA: 17s - loss: 0.6623 - acc: 0.59 - ETA: 16s - loss: 0.6625 - acc: 0.60 - ETA: 16s - loss: 0.6615 - acc: 0.60 - ETA: 15s - loss: 0.6589 - acc: 0.60 - ETA: 14s - loss: 0.6569 - acc: 0.61 - ETA: 14s - loss: 0.6565 - acc: 0.61 - ETA: 13s - loss: 0.6550 - acc: 0.61 - ETA: 12s - loss: 0.6544 - acc: 0.61 - ETA: 12s - loss: 0.6531 - acc: 0.61 - ETA: 11s - loss: 0.6510 - acc: 0.62 - ETA: 11s - loss: 0.6488 - acc: 0.62 - ETA: 10s - loss: 0.6479 - acc: 0.62 - ETA: 10s - loss: 0.6467 - acc: 0.62 - ETA: 9s - loss: 0.6461 - acc: 0.6250 - ETA: 9s - loss: 0.6430 - acc: 0.627 - ETA: 8s - loss: 0.6412 - acc: 0.628 - ETA: 7s - loss: 0.6385 - acc: 0.631 - ETA: 7s - loss: 0.6350 - acc: 0.634 - ETA: 6s - loss: 0.6341 - acc: 0.636 - ETA: 6s - loss: 0.6304 - acc: 0.640 - ETA: 5s - loss: 0.6299 - acc: 0.641 - ETA: 5s - loss: 0.6329 - acc: 0.642 - ETA: 4s - loss: 0.6312 - acc: 0.644 - ETA: 4s - loss: 0.6312 - acc: 0.645 - ETA: 3s - loss: 0.6286 - acc: 0.646 - ETA: 3s - loss: 0.6249 - acc: 0.649 - ETA: 2s - loss: 0.6211 - acc: 0.652 - ETA: 2s - loss: 0.6197 - acc: 0.653 - ETA: 1s - loss: 0.6177 - acc: 0.654 - ETA: 1s - loss: 0.6161 - acc: 0.657 - ETA: 0s - loss: 0.6172 - acc: 0.656 - ETA: 0s - loss: 0.6160 - acc: 0.656 - 37s 15ms/step - loss: 0.6144 - acc: 0.6583 - val_loss: 0.5266 - val_acc: 0.7167
Epoch 2/5
2400/2400 [==============================] - ETA: 27s - loss: 0.4231 - acc: 0.78 - ETA: 26s - loss: 0.3893 - acc: 0.81 - ETA: 26s - loss: 0.3896 - acc: 0.83 - ETA: 26s - loss: 0.3994 - acc: 0.82 - ETA: 26s - loss: 0.3997 - acc: 0.82 - ETA: 25s - loss: 0.4102 - acc: 0.81 - ETA: 25s - loss: 0.4125 - acc: 0.81 - ETA: 24s - loss: 0.4207 - acc: 0.80 - ETA: 24s - loss: 0.4145 - acc: 0.81 - ETA: 23s - loss: 0.4092 - acc: 0.82 - ETA: 23s - loss: 0.3961 - acc: 0.82 - ETA: 23s - loss: 0.3885 - acc: 0.83 - ETA: 23s - loss: 0.3812 - acc: 0.84 - ETA: 23s - loss: 0.3770 - acc: 0.84 - ETA: 22s - loss: 0.3784 - acc: 0.84 - ETA: 22s - loss: 0.3826 - acc: 0.84 - ETA: 22s - loss: 0.3819 - acc: 0.84 - ETA: 21s - loss: 0.3746 - acc: 0.84 - ETA: 21s - loss: 0.3675 - acc: 0.84 - ETA: 21s - loss: 0.3664 - acc: 0.84 - ETA: 20s - loss: 0.3714 - acc: 0.84 - ETA: 20s - loss: 0.3731 - acc: 0.83 - ETA: 19s - loss: 0.3707 - acc: 0.83 - ETA: 19s - loss: 0.3648 - acc: 0.84 - ETA: 19s - loss: 0.3667 - acc: 0.83 - ETA: 18s - loss: 0.3749 - acc: 0.83 - ETA: 18s - loss: 0.3741 - acc: 0.83 - ETA: 17s - loss: 0.3685 - acc: 0.83 - ETA: 17s - loss: 0.3652 - acc: 0.83 - ETA: 17s - loss: 0.3696 - acc: 0.83 - ETA: 16s - loss: 0.3710 - acc: 0.83 - ETA: 16s - loss: 0.3734 - acc: 0.83 - ETA: 15s - loss: 0.3698 - acc: 0.84 - ETA: 15s - loss: 0.3703 - acc: 0.83 - ETA: 15s - loss: 0.3718 - acc: 0.83 - ETA: 14s - loss: 0.3782 - acc: 0.83 - ETA: 14s - loss: 0.3797 - acc: 0.83 - ETA: 13s - loss: 0.3806 - acc: 0.83 - ETA: 13s - loss: 0.3798 - acc: 0.83 - ETA: 13s - loss: 0.3742 - acc: 0.83 - ETA: 12s - loss: 0.3693 - acc: 0.83 - ETA: 12s - loss: 0.3705 - acc: 0.84 - ETA: 12s - loss: 0.3708 - acc: 0.83 - ETA: 11s - loss: 0.3696 - acc: 0.83 - ETA: 11s - loss: 0.3683 - acc: 0.84 - ETA: 10s - loss: 0.3660 - acc: 0.84 - ETA: 10s - loss: 0.3671 - acc: 0.84 - ETA: 10s - loss: 0.3697 - acc: 0.83 - ETA: 9s - loss: 0.3691 - acc: 0.8399 - ETA: 9s - loss: 0.3674 - acc: 0.841 - ETA: 9s - loss: 0.3653 - acc: 0.843 - ETA: 8s - loss: 0.3632 - acc: 0.843 - ETA: 8s - loss: 0.3634 - acc: 0.843 - ETA: 7s - loss: 0.3641 - acc: 0.842 - ETA: 7s - loss: 0.3648 - acc: 0.841 - ETA: 7s - loss: 0.3633 - acc: 0.843 - ETA: 6s - loss: 0.3614 - acc: 0.844 - ETA: 6s - loss: 0.3587 - acc: 0.846 - ETA: 6s - loss: 0.3579 - acc: 0.846 - ETA: 5s - loss: 0.3555 - acc: 0.848 - ETA: 5s - loss: 0.3590 - acc: 0.846 - ETA: 4s - loss: 0.3590 - acc: 0.845 - ETA: 4s - loss: 0.3627 - acc: 0.842 - ETA: 4s - loss: 0.3599 - acc: 0.843 - ETA: 3s - loss: 0.3586 - acc: 0.844 - ETA: 3s - loss: 0.3592 - acc: 0.844 - ETA: 3s - loss: 0.3576 - acc: 0.845 - ETA: 2s - loss: 0.3561 - acc: 0.846 - ETA: 2s - loss: 0.3568 - acc: 0.846 - ETA: 1s - loss: 0.3577 - acc: 0.847 - ETA: 1s - loss: 0.3575 - acc: 0.846 - ETA: 1s - loss: 0.3556 - acc: 0.848 - ETA: 0s - loss: 0.3555 - acc: 0.848 - ETA: 0s - loss: 0.3550 - acc: 0.848 - 30s 12ms/step - loss: 0.3554 - acc: 0.8483 - val_loss: 0.4714 - val_acc: 0.7883
Epoch 3/5
2400/2400 [==============================] - ETA: 25s - loss: 0.2984 - acc: 0.81 - ETA: 24s - loss: 0.1909 - acc: 0.90 - ETA: 24s - loss: 0.1906 - acc: 0.91 - ETA: 23s - loss: 0.2055 - acc: 0.92 - ETA: 23s - loss: 0.1966 - acc: 0.92 - ETA: 23s - loss: 0.2085 - acc: 0.91 - ETA: 22s - loss: 0.2251 - acc: 0.91 - ETA: 22s - loss: 0.2203 - acc: 0.91 - ETA: 22s - loss: 0.2142 - acc: 0.91 - ETA: 21s - loss: 0.2210 - acc: 0.91 - ETA: 22s - loss: 0.2297 - acc: 0.90 - ETA: 22s - loss: 0.2287 - acc: 0.90 - ETA: 21s - loss: 0.2276 - acc: 0.90 - ETA: 21s - loss: 0.2221 - acc: 0.90 - ETA: 21s - loss: 0.2188 - acc: 0.91 - ETA: 21s - loss: 0.2107 - acc: 0.91 - ETA: 20s - loss: 0.2127 - acc: 0.91 - ETA: 20s - loss: 0.2156 - acc: 0.91 - ETA: 20s - loss: 0.2102 - acc: 0.91 - ETA: 19s - loss: 0.2102 - acc: 0.91 - ETA: 19s - loss: 0.2094 - acc: 0.91 - ETA: 19s - loss: 0.2115 - acc: 0.91 - ETA: 19s - loss: 0.2129 - acc: 0.91 - ETA: 18s - loss: 0.2123 - acc: 0.91 - ETA: 18s - loss: 0.2095 - acc: 0.91 - ETA: 18s - loss: 0.2082 - acc: 0.91 - ETA: 17s - loss: 0.2047 - acc: 0.91 - ETA: 17s - loss: 0.2025 - acc: 0.92 - ETA: 16s - loss: 0.1996 - acc: 0.92 - ETA: 16s - loss: 0.2027 - acc: 0.92 - ETA: 15s - loss: 0.2000 - acc: 0.92 - ETA: 15s - loss: 0.2008 - acc: 0.91 - ETA: 15s - loss: 0.2013 - acc: 0.91 - ETA: 14s - loss: 0.2028 - acc: 0.91 - ETA: 14s - loss: 0.2043 - acc: 0.91 - ETA: 14s - loss: 0.2060 - acc: 0.91 - ETA: 13s - loss: 0.2062 - acc: 0.91 - ETA: 13s - loss: 0.2037 - acc: 0.91 - ETA: 12s - loss: 0.2055 - acc: 0.91 - ETA: 12s - loss: 0.2043 - acc: 0.91 - ETA: 12s - loss: 0.2059 - acc: 0.91 - ETA: 11s - loss: 0.2056 - acc: 0.91 - ETA: 11s - loss: 0.2058 - acc: 0.91 - ETA: 11s - loss: 0.2048 - acc: 0.91 - ETA: 10s - loss: 0.2072 - acc: 0.91 - ETA: 10s - loss: 0.2047 - acc: 0.91 - ETA: 9s - loss: 0.2068 - acc: 0.9182 - ETA: 9s - loss: 0.2141 - acc: 0.917 - ETA: 9s - loss: 0.2151 - acc: 0.915 - ETA: 8s - loss: 0.2159 - acc: 0.914 - ETA: 8s - loss: 0.2202 - acc: 0.913 - ETA: 8s - loss: 0.2202 - acc: 0.914 - ETA: 7s - loss: 0.2173 - acc: 0.915 - ETA: 7s - loss: 0.2164 - acc: 0.916 - ETA: 7s - loss: 0.2172 - acc: 0.915 - ETA: 6s - loss: 0.2151 - acc: 0.916 - ETA: 6s - loss: 0.2154 - acc: 0.916 - ETA: 5s - loss: 0.2160 - acc: 0.916 - ETA: 5s - loss: 0.2158 - acc: 0.916 - ETA: 5s - loss: 0.2154 - acc: 0.916 - ETA: 4s - loss: 0.2168 - acc: 0.916 - ETA: 4s - loss: 0.2175 - acc: 0.914 - ETA: 4s - loss: 0.2178 - acc: 0.914 - ETA: 3s - loss: 0.2183 - acc: 0.914 - ETA: 3s - loss: 0.2179 - acc: 0.913 - ETA: 3s - loss: 0.2182 - acc: 0.912 - ETA: 2s - loss: 0.2192 - acc: 0.912 - ETA: 2s - loss: 0.2196 - acc: 0.912 - ETA: 2s - loss: 0.2208 - acc: 0.912 - ETA: 1s - loss: 0.2205 - acc: 0.912 - ETA: 1s - loss: 0.2220 - acc: 0.911 - ETA: 1s - loss: 0.2208 - acc: 0.912 - ETA: 0s - loss: 0.2214 - acc: 0.911 - ETA: 0s - loss: 0.2206 - acc: 0.912 - 28s 12ms/step - loss: 0.2216 - acc: 0.9125 - val_loss: 0.5212 - val_acc: 0.7850
Epoch 4/5
2400/2400 [==============================] - ETA: 24s - loss: 0.0633 - acc: 1.00 - ETA: 23s - loss: 0.1467 - acc: 0.98 - ETA: 23s - loss: 0.1292 - acc: 0.98 - ETA: 23s - loss: 0.1655 - acc: 0.97 - ETA: 23s - loss: 0.1516 - acc: 0.97 - ETA: 22s - loss: 0.1481 - acc: 0.97 - ETA: 22s - loss: 0.1418 - acc: 0.97 - ETA: 22s - loss: 0.1443 - acc: 0.96 - ETA: 21s - loss: 0.1451 - acc: 0.96 - ETA: 21s - loss: 0.1372 - acc: 0.96 - ETA: 21s - loss: 0.1299 - acc: 0.97 - ETA: 20s - loss: 0.1286 - acc: 0.97 - ETA: 20s - loss: 0.1232 - acc: 0.97 - ETA: 20s - loss: 0.1206 - acc: 0.97 - ETA: 19s - loss: 0.1183 - acc: 0.97 - ETA: 19s - loss: 0.1197 - acc: 0.97 - ETA: 19s - loss: 0.1171 - acc: 0.97 - ETA: 19s - loss: 0.1175 - acc: 0.97 - ETA: 19s - loss: 0.1209 - acc: 0.96 - ETA: 19s - loss: 0.1225 - acc: 0.96 - ETA: 18s - loss: 0.1201 - acc: 0.96 - ETA: 18s - loss: 0.1190 - acc: 0.96 - ETA: 18s - loss: 0.1233 - acc: 0.96 - ETA: 18s - loss: 0.1472 - acc: 0.95 - ETA: 18s - loss: 0.1558 - acc: 0.95 - ETA: 17s - loss: 0.1532 - acc: 0.95 - ETA: 17s - loss: 0.1559 - acc: 0.95 - ETA: 17s - loss: 0.1572 - acc: 0.95 - ETA: 17s - loss: 0.1595 - acc: 0.95 - ETA: 16s - loss: 0.1629 - acc: 0.94 - ETA: 16s - loss: 0.1600 - acc: 0.94 - ETA: 16s - loss: 0.1599 - acc: 0.94 - ETA: 15s - loss: 0.1587 - acc: 0.94 - ETA: 15s - loss: 0.1558 - acc: 0.94 - ETA: 15s - loss: 0.1531 - acc: 0.95 - ETA: 14s - loss: 0.1497 - acc: 0.95 - ETA: 14s - loss: 0.1476 - acc: 0.95 - ETA: 14s - loss: 0.1443 - acc: 0.95 - ETA: 13s - loss: 0.1450 - acc: 0.95 - ETA: 13s - loss: 0.1451 - acc: 0.95 - ETA: 13s - loss: 0.1430 - acc: 0.95 - ETA: 12s - loss: 0.1445 - acc: 0.95 - ETA: 12s - loss: 0.1444 - acc: 0.95 - ETA: 12s - loss: 0.1443 - acc: 0.95 - ETA: 11s - loss: 0.1442 - acc: 0.95 - ETA: 11s - loss: 0.1429 - acc: 0.95 - ETA: 11s - loss: 0.1423 - acc: 0.95 - ETA: 10s - loss: 0.1424 - acc: 0.95 - ETA: 10s - loss: 0.1432 - acc: 0.95 - ETA: 9s - loss: 0.1421 - acc: 0.9537 - ETA: 9s - loss: 0.1430 - acc: 0.954 - ETA: 9s - loss: 0.1410 - acc: 0.954 - ETA: 8s - loss: 0.1404 - acc: 0.955 - ETA: 8s - loss: 0.1412 - acc: 0.954 - ETA: 7s - loss: 0.1421 - acc: 0.955 - ETA: 7s - loss: 0.1422 - acc: 0.954 - ETA: 7s - loss: 0.1414 - acc: 0.954 - ETA: 6s - loss: 0.1430 - acc: 0.953 - ETA: 6s - loss: 0.1426 - acc: 0.952 - ETA: 5s - loss: 0.1421 - acc: 0.952 - ETA: 5s - loss: 0.1418 - acc: 0.952 - ETA: 5s - loss: 0.1414 - acc: 0.953 - ETA: 4s - loss: 0.1432 - acc: 0.951 - ETA: 4s - loss: 0.1435 - acc: 0.952 - ETA: 3s - loss: 0.1429 - acc: 0.951 - ETA: 3s - loss: 0.1437 - acc: 0.951 - ETA: 3s - loss: 0.1433 - acc: 0.951 - ETA: 2s - loss: 0.1420 - acc: 0.952 - ETA: 2s - loss: 0.1406 - acc: 0.952 - ETA: 1s - loss: 0.1398 - acc: 0.953 - ETA: 1s - loss: 0.1398 - acc: 0.952 - ETA: 1s - loss: 0.1407 - acc: 0.951 - ETA: 0s - loss: 0.1399 - acc: 0.951 - ETA: 0s - loss: 0.1398 - acc: 0.951 - 31s 13ms/step - loss: 0.1397 - acc: 0.9513 - val_loss: 0.6553 - val_acc: 0.7833
Epoch 5/5
2400/2400 [==============================] - ETA: 31s - loss: 0.0943 - acc: 0.96 - ETA: 31s - loss: 0.0598 - acc: 0.98 - ETA: 31s - loss: 0.0721 - acc: 0.97 - ETA: 31s - loss: 0.0689 - acc: 0.98 - ETA: 30s - loss: 0.0713 - acc: 0.98 - ETA: 30s - loss: 0.0706 - acc: 0.98 - ETA: 29s - loss: 0.0653 - acc: 0.98 - ETA: 29s - loss: 0.0619 - acc: 0.98 - ETA: 28s - loss: 0.0675 - acc: 0.98 - ETA: 27s - loss: 0.0673 - acc: 0.98 - ETA: 26s - loss: 0.0670 - acc: 0.98 - ETA: 26s - loss: 0.0719 - acc: 0.98 - ETA: 25s - loss: 0.0674 - acc: 0.98 - ETA: 24s - loss: 0.0662 - acc: 0.98 - ETA: 24s - loss: 0.0636 - acc: 0.98 - ETA: 23s - loss: 0.0608 - acc: 0.98 - ETA: 23s - loss: 0.0641 - acc: 0.98 - ETA: 23s - loss: 0.0623 - acc: 0.98 - ETA: 22s - loss: 0.0625 - acc: 0.98 - ETA: 22s - loss: 0.0648 - acc: 0.98 - ETA: 21s - loss: 0.0633 - acc: 0.98 - ETA: 21s - loss: 0.0637 - acc: 0.98 - ETA: 20s - loss: 0.0626 - acc: 0.98 - ETA: 20s - loss: 0.0639 - acc: 0.98 - ETA: 20s - loss: 0.0638 - acc: 0.98 - ETA: 19s - loss: 0.0669 - acc: 0.98 - ETA: 19s - loss: 0.0651 - acc: 0.98 - ETA: 18s - loss: 0.0661 - acc: 0.98 - ETA: 18s - loss: 0.0666 - acc: 0.98 - ETA: 17s - loss: 0.0655 - acc: 0.98 - ETA: 17s - loss: 0.0652 - acc: 0.98 - ETA: 17s - loss: 0.0653 - acc: 0.98 - ETA: 16s - loss: 0.0695 - acc: 0.98 - ETA: 16s - loss: 0.0733 - acc: 0.97 - ETA: 16s - loss: 0.0734 - acc: 0.97 - ETA: 15s - loss: 0.0720 - acc: 0.97 - ETA: 15s - loss: 0.0732 - acc: 0.97 - ETA: 14s - loss: 0.0719 - acc: 0.97 - ETA: 14s - loss: 0.0750 - acc: 0.97 - ETA: 14s - loss: 0.0734 - acc: 0.97 - ETA: 13s - loss: 0.0735 - acc: 0.97 - ETA: 13s - loss: 0.0742 - acc: 0.97 - ETA: 13s - loss: 0.0737 - acc: 0.97 - ETA: 12s - loss: 0.0727 - acc: 0.97 - ETA: 12s - loss: 0.0717 - acc: 0.97 - ETA: 11s - loss: 0.0712 - acc: 0.97 - ETA: 11s - loss: 0.0712 - acc: 0.97 - ETA: 11s - loss: 0.0739 - acc: 0.97 - ETA: 10s - loss: 0.0735 - acc: 0.97 - ETA: 10s - loss: 0.0730 - acc: 0.97 - ETA: 9s - loss: 0.0772 - acc: 0.9755 - ETA: 9s - loss: 0.0763 - acc: 0.976 - ETA: 8s - loss: 0.0765 - acc: 0.975 - ETA: 8s - loss: 0.0774 - acc: 0.974 - ETA: 8s - loss: 0.0768 - acc: 0.975 - ETA: 7s - loss: 0.0762 - acc: 0.975 - ETA: 7s - loss: 0.0767 - acc: 0.975 - ETA: 6s - loss: 0.0776 - acc: 0.974 - ETA: 6s - loss: 0.0778 - acc: 0.974 - ETA: 5s - loss: 0.0777 - acc: 0.973 - ETA: 5s - loss: 0.0774 - acc: 0.973 - ETA: 5s - loss: 0.0770 - acc: 0.973 - ETA: 4s - loss: 0.0760 - acc: 0.973 - ETA: 4s - loss: 0.0750 - acc: 0.974 - ETA: 3s - loss: 0.0760 - acc: 0.973 - ETA: 3s - loss: 0.0755 - acc: 0.974 - ETA: 3s - loss: 0.0766 - acc: 0.972 - ETA: 2s - loss: 0.0762 - acc: 0.972 - ETA: 2s - loss: 0.0762 - acc: 0.972 - ETA: 1s - loss: 0.0764 - acc: 0.972 - ETA: 1s - loss: 0.0760 - acc: 0.973 - ETA: 1s - loss: 0.0753 - acc: 0.973 - ETA: 0s - loss: 0.0754 - acc: 0.973 - ETA: 0s - loss: 0.0795 - acc: 0.973 - 30s 12ms/step - loss: 0.0796 - acc: 0.9725 - val_loss: 0.8535 - val_acc: 0.7850
In [110]:
score = modelcnn.evaluate(X_test, Y_test, batch_size=32)

print('Test Loss:', score[0])
print('Test Accuracy:', score[1])
600/600 [==============================] - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - ETA:  - 1s 2ms/step
Test Loss: 0.8535148517290752
Test Accuracy: 0.7849999992052714
In [111]:
Y_pred = modelcnn.predict(X_test)
print(Y_pred)
[[9.9848491e-01 1.5150915e-03]
 [9.8279220e-01 1.7207809e-02]
 [9.9948126e-01 5.1872607e-04]
 ...
 [1.8794717e-02 9.8120534e-01]
 [5.4257847e-02 9.4574213e-01]
 [1.5789610e-01 8.4210390e-01]]
In [112]:
y_pred = np.argmax(Y_pred, axis=1)
print(y_pred)
[0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0
 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 1 0 1 0
 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 1 1 1
 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0
 0 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0
 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1
 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0
 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 1 1
 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 1 0
 1 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 1 1 0 0 1 0
 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0
 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0
 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0
 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0
 1 0 1 0 1 1 1 1]
In [115]:
plt.title('Accuracy')
plt.plot(hist.history['acc'], label='train')
plt.plot(hist.history['val_acc'], label='test')

plt.show();

10I. Stacking Model

In [77]:
from scipy.stats import mode
In [ ]:
stack = StackingCVClassifier(lr_pred_tv_test_clntxt,NB_pred_test,rf_pred_test, dtc_pred_test, GBM_pred_test)
stack.fit(X_train_TV_clntxt, Y_train_TV_clntxt)`
In [ ]:
stack.predict(X_train_TV_clntxt)
stack.predict(X_test_TV_clntxt)
In [ ]:
stack.predict(TV_Mat_unseen)
In [61]:
stack_test = np.array([lr_pred_tv_test_clntxt,NB_pred_test,rf_pred_test, dtc_pred_test, GBM_pred_test]).T
stack_train = np.array([lr_pred_tv_train_clntxt,NB_pred_train,rf_pred_train, dtc_pred_train, GBM_pred_train]).T
stacked_pred_train = mode(stack_train,axis=1)[0]
stacked_pred_test = mode(stack_test,axis=1)[0]
In [62]:
### Train data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,stacked_pred_train))
print("\nTrain DATA ACCURACY",accuracy_score(Y_train_TV_clntxt,stacked_pred_train))
print("\nTrain data f1-score for class '1'",f1_score(Y_train_TV_clntxt,stacked_pred_train, average='weighted'))

### Test data accuracy
print("\n\n--------------------------------------\n\n")
print("TEST DATA ACCURACY",accuracy_score(Y_test_TV_clntxt,stacked_pred_test))
print("\nTest data f1-score for class '1'",f1_score(Y_test_TV_clntxt,stacked_pred_test,pos_label='1'))
print("\nTest data f1-score for class '2'",f1_score(Y_test_TV_clntxt,stacked_pred_test,pos_label='0'))

--------------------------------------


TEST DATA ACCURACY 0.90875

Train DATA ACCURACY 0.90875

Train data f1-score for class '1' 0.9087206068130811


--------------------------------------


TEST DATA ACCURACY 0.8216666666666667

Test data f1-score for class '1' 0.834108527131783

Test data f1-score for class '2' 0.8072072072072072
In [149]:
stack_unseen = np.array([lr_pred_unseen,NB_pred_unseen,rf_pred_unseen, dtc_pred_unseen, GBM_pred_unseen]).T
In [150]:
stack_unseen= pd.DataFrame(stack_unseen)
In [151]:
stack_unseen.head(3)
Out[151]:
0 1 2 3 4
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
In [161]:
stacked_pred_unseen  = mode(stack_unseen,axis=1)[0]
In [165]:
stacked_pred_unseen.head()
Out[165]:
0
0 0
1 0
2 0
3 0
4 0
In [163]:
stacked_pred_unseen = pd.DataFrame(stacked_pred_unseen)
In [166]:
stacked_pred_unseen.to_csv("stacking_out.csv")

ModelPerformance

In [63]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\modelperformance.png")
Out[63]:
In [69]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\hakunamatata1.png")
Out[69]:

Wishing you Hakuna Matata for rest of your Days!

In [70]:
from IPython.display import display, Image
Image("D:\Data_science\PHD\Theend.png")
Out[70]: